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ABSTRACT 

Validation of a Diagnostic Interpretation Technique 
for The Iowa Tests of Basic Skills 



Building upon the body of literature recommending the diagnostic 
use and interpretation of standardized achievement tests, this project 
focused on three studies related to group interpretation of the sub- 
skills tested by the Iowa Tests of Basic Skills (Hieronymus, Lindquist, 
& Hoover, ,1978). The Interpretation teqhntque employed emphasized 
providing feedback to students and teachers about performance on each 
cf 60 to 70 skills tested at the various levels of^the standardized 
achievement test. 

Study 1 was a study of the impact of the interpretation sessions 
on teachers and students. Both attitudes toward standardized tests and 
knowledge about the Iowa Tests of Basic Skills were assessed for experi- 
mental and control groups of teachers and students. Study 2 addressed 
the commonly held belief that teachers have fairly accurate perceptions 
of "how well students are doing" in their skills development. This 
study also explored the differences*! n estimations made under raw-score • 
and norm-referenced frameworks, anc on the effects of grade level, 
student sex, and overall mathematics achievement on the predictions. 
Study 3 was essentially a concurrent validity study between the Iowa 
Tests of Basic^Skills and the Stanford Diagnostic Mathematics Test. 
Since the interpretation technique employed constituted a diagnostic 
use of the survey test, this study was included to address the question 
of whether it measured the same things in the same ways as the diagnostic 
test. 

The findings of the three studies incorporated into this project 
led to a conclusion that there is a need for the interpretation of 
the results of tests administered in school -wide testing programs, and 
there was modest support for providing "diagnostic" interpretation of 
the "survey" test. At least two base% for this conclusion were found. 
First, students who have been through the interpretation process used, 
felt that they had done better on the test than students who had not 
hdd the test results interpreted to them. Secondly, the act of 
interpretation shnnJfl raise important questions for the teachers, as 
discrepancies between expectations and actual performance occur.^ This 
should benefit both students and teachers as reasons for the discrepancies 
between the students* behaviors and the teachers* expectations are 
explained. The benefit for teachers should be an opportunity to: 
1) reassess their expectations for certain students; and 2) examine some 
of their biases about the performance of certain subgroups in the subject 
areas tested. The benefit fdr students should be a better educational' 
process borjie out of higher expectations for themselves and more appro- 
priate expectations from their teachers, regardless of the student's sex 
or overall achievement level. 



VALIDATION OF A DIAGNOSTIC INTERPRETATION TECHNIQUE 
FOR THE IOWA TESTS OF BASIC SKILLS 



PART I: INTRODUCTION 

In 1972, the American Personnel and Guidance Association and the 
National Council on Measurement in Education adopted a joint resolution 
on the responsible use of tests, tfn part, their position statement 
reads : 

In schools and colleges the principal needs served . 
by testing include the providing of information 
(1) to teachers as an aid to the improvement of 
instruction; (2) to students and, in the case of 
younger students, to their parents, as an aid to 
self-understanding and to both educational and 
vocational planning; and .(3) to administrators, 
. as a basis for planning, decision-making, and 
evaluating the effectiveness of programs .and 
.cJperati_Qns. (American Personnel and Guidance 
Association & National Council on Measurement in ^* 
Education, 1972.) 
Further, in 1980, the American Personnel and Guidance Association 
issued a policy stateirent titled "Responsibilities of Users of 
Standardized Tests" (American Personnel aod Guidance Association, 1980). 
This policy statement emphasized the importance of the test user 
becoming familiar with the test, and the need for presentation of the 
test data so that it is comprehensible to the user. 

The importance of comprehensible feedback to the person tested^ has 
been previously recognized as a condition of the ethical use of tests 
(Lyman, 1974), and a? ah important aspect of student motivation 
(Kirkland, 1971). Feedback is further recognized as one means of 
meeting test consumers* needs (Bradley, 1981), and as the core of 
Bradley's (1978) person-referenced test interpretation. Through 
person- referenced interpretation, items from a test are reviewed b> the 
$4;udept and a counselor in an attempt to personalize the results for 
the student and to assist the student in processing the information 
gleaned through the interpretation. Bradley (1978) contends that it ° 



is difficult to personalize performance and'^fuTly pronxDte self- 
understanding usfng normative scores alone. Buros (1977)^ in the 
same vein, suggested that the recording of normed. scores alone be ^ 
replaced by compound scores consisting of the normed scores, the 
percentage of items estimated to be known, and, po^ssibly the obtained < 
percentile rank. The intent of the compound score is according • 
Buros, "to shift our emphasis from differentiation to measurement." 

When moves away from the use of normative scores are made, there 
is a logical, following use of the test results for diagnostic purposes. 
At the heart of person- referenced test interpretation is an intention 

to allow the student to analyze correct and incorrect responses to 
♦ - 

individual items on the test and to respond to that analysis in a 
personal way. Presuma||)ly , a similar outcome would result if the focu? 
of testing moved more toward measurement than differentiation. ^ 
Diagnostic interpretation of tests has been encouraged by test pub- 
lishers, thro'jgh various raw-score and item-response report forms for 
their standardized, survey instruments. The ctaims made for tKese 
report forms range from providing clues for selective follow-up 
(Hieronymus, Lindqulst, & Hoover, 1979» p. 31), to helping point out 
a student's relative strengths and weaknesses within a specific skill 
domain (Prescott, Balow, Hogan, & Farr, 1978, p. 33), to determining 
an individual's strengths and weaknesses in the various categories of 
skills tested (CTB/Mc6raw Hill, 1977, p. 65). 

In gene/al, the interpretation of item data and/or, skill clusters 
for diagnostic uses has been widely recommended. Ebel (1972) suggests 
that "many achievement tests can provide 'diagnostic* information of 
value to the individual pupil if he is told which it-ems he missed" 
(p. 478), and Rudman (1977j indicates that through scoring options 
available from t^st publishers, "classically constructed standardized 
achievemert tests can be used analytically, for they can be referenced 
in one of several modes: by norms, by c^r^iteria, by objectives" (p. 181). 

Building upon the body of literature recommending the diagnostic 
use and interpretation of standardized achievement tests, this project 
focused on three studies related to group interpretation of the sub- 
skills tested by the Iowa Tests of Basic Skills f Hi eronyffius , Lindquist, 
& Hoover, 1978). The interpretation technique employed emphasized 
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providing feedback to students and teachers about perfomance on each 
of 60 to 70 skills tested at th'e varioos levels of the standardized 
achievement test. Tnis interpretation technique is described in 
Part II: The Interpretation Technique. 

Study 1, which is described in Part III: The Imp^act Study, was- 
a study of . the inpact of the interpretation sessions on teachers and 
students. Both attitudes toward standardized tests and .'knowledge 
about the Iowa Tests of Basic Skills were assessed for experimental 
and control groups of teacher^nd students. 

Study 2, ::overed in Part IV:' Teachers' Predictions of Student 
Performajnce on SubslTilK of Mathematics, addressed the commonly held 
belief that teachers have fairly accurate perceptions of "Row well 
students are doing" ia their skills development. This study also 
explored the- differences in estimations made under raw-score and 
nornrn referenced frameworks, and on the effects of grade level, student 
sex, and overall mathematics achievement on the predictions. 

Study 3, prescited in Part V: Relationships Between the Results 
of the Iowa Tests of Basic Skills, Mathemaxics Subtests, and the 
Stanford Di^agnostic Mathematics Test, was essentially a concurrent 
validity study. Since the iaterpretation technique employed consti- 
tuted a diagnostic use of the -survey test, this study was included .to 
address the question of whether it measured the same things in the 
same ways as the diagnostic test. 

Finally, in Part VI: Summary of the Project and Conclusions, a 
discussion of the validity . and usefulness of the interpretation tech- 
nique is provided. In this discussion the pertinent findings of the 
three studies are integrated to highlight the uses of the technique 
and to point out the weaknesses and scrme cautions that should be 
observed. These consid^ations indicate some specific needs for 
further research on the process and include speculation about some 
aspects of the interpretation process which were not studied. 

- PART Hi. THE INTERPRETATI0NJECHNIQU£ 

The approach described h^re was conducted in classroom-sized 
groups, and required approximately 40 to 50 minutes per group. It 
was a time-efficient way to provide feedback to large numbers of 
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students, thus, infee'ting the professional responsioility of interpretation 
and freeing valuable teacher or counselor time for dealing with students' 
who need extra help -in uriderstartMng the test results. The interpreta- 
tion process involved several steps, leading students to a summary of 
their own performance on edch of approximately 60 skills identified on 
the Iowa Tests of Basic Skills. The students used their own scoring 
service report form j^the Pupil Item Response Record) and a SJ<ill Sumiary 
Sheet, which was constructed for the project, . . 

Descri^ion of the Pupil Item Response Record *^ 

The Pupil Item Response Record provides complete infomation on 
each pupiVs answer for each question on the test, L'entifying infor- 
mation about the^ student, grade-equivalent (or other developmental) 
scores, and percentile ranks are'n'ven ort the report). ^In addition, ^ 
the percentage of correct and inccrrect responses for eacji subtest, 
plus the item number* the student's response to eaCh item (correct, 
incorrect, or. omit), the difficul tlTlevel of the- item, ^ -and the skil.l 
measured by 'he item are provided. Figure 1 shows , a sample segment of 
the report fcr one of the eleven subtests on the Iowa Tests of Basic' 
Skills main battery. A "+** indicates* a correct answer, a/'*" . \ 
indicates an incorrect response, and an "0" represents a question lefjt * 
unanswered.^ The item numbers are read vertically and are cut of 
"sequence, because the scoring program clusters items for the satne. 
skill together. The difficulty scale runs from 1 to 9, with 1 being 
the most difficult (10-19 percent of students in the standardization 
sample selecting a correct response), , ^ - 

The skill codes are as follows: 

1 = Single-step problems: ad(jition or subtraction. 

2 = Single-step problems: multiplication or division, 

* 3 = Multiple-step problems: combined use of basic operations. 
The secondary skill codes, C, W, and F, represent currency, whole numbers, 
and fractions, respectively. The secondary codes are no± used in the 
technique pn^sented here. 



Figure 1. Sample Section of a Pupil Item Response Record for Level 11, 
Form 8 of the Iowa Tests of Basic Skills. 



Test M-'^ Mathematics Problems 


. GE = 52 PR = 50 % Correct = 52 % incorrect = 37 NA = 11 


item 
number 


1 1 222333 2223334 1 34 2, 222 3 3344 
78139027 025349 1-^ 980467815623 


response 


i 


di fficulty 


77767766 8555553 833864645655 


skill 


11111111 2222222 3333. 3 3333333 
CCCCWWFF CCCWWWF CCCWWWWWWWWW 

1 



t Description of the Skill Summary. Sheet 

.The Skill Summary Sheet is a listing af the major skills tested 
on the Iowa Tests of Basic Skills. For each major skill tested, three 
broad categories of performance were defined; 1) Satisfactory orogress 
likely; 2) More information needed; and 3) Possible problem area. The 
three categories^ were divided on the sheet accot^ding tQ raw »€&re 
performance, which was adjusted "or differences in the mean difficulty 
of the set of items tised to tfest each skill. 

In determining the raw score ranges for the three cateyories of 
perforfnance» the "Satisfactory progress likely" column 'included raw 
scores generally at'or above the mean raw score perforfnance of the 

' students in the norm group. The* division between the remaining two 
categories was determinecl by allocating trie items on a percentage basis, 
of approximately half the difference between the mean percent correct 
and zero. 

For example, a skill tested with 10 items and having a mean diff*''- • 
culty of 50 percent would have raw scores qf 5 through 10 in the 
"Satisfactory progress likely" column, scores of 2, 3, and 4 in the 
"More information needed" column, and scores of 0 and 1 in the "Possible 
problem area" column. Another ski 11 > also tested with 10 items but 
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having a mean difficulty of 70 percent, would have only scores of 7 
through 10 in the *'Satis factory progress likely" column, with 3 through 
C iff "More information needed," and 0, 1, and ? in "Possible problem 
area, " 

On skills tested with small numbers of items, some adjustments in 
the above approach wire made. These adjustments wc^re necessary in 
oraer to avoid such absurdities as having .aro scores in the "Satis- 
factory progress likely' column. Where only one item appeared for a 
skill, botii scores of 0 and 1 were included in "More information needed." 

Figure 2 provides an example of a portion of the Skill Summary 
Sheet. This sample can be used with Figure 1 to complete the process 
described in the following section. 



Figure 2, Sample Section of the Skill Summary Sheet for Lev^l 11, 
Form 8 of the Iowa Tests of Basic Skills. (Numerals 
represent possible raw scores.) 



Test M-2: Mathematics Problem Solving 


• 


Possible 
Problem 
Area 


Moro ^ 
Information 
Needed 


Satisfactory 
Progress 
Likely 


1: single-step problems 
• addition/subtraction 


0 1 2 


3 4 5 


6(7)b 


2: single-step probfems 
mu 1 ti pi i ca ti on/di vi si on 


0 


1 2 3 


7 


0 

3: mu' ' i pie-step problems 
combined use of basic 
operations 


0 l£ 


3 4 5 


6 7 8 9 10 11 12 



Description of the Interpretation Process 

Using the two sample forms shawn in Figures 1 and 2 together, a 
capsule form of the process can be seen. First the student would find 
the "Skill" row in Figure 1 and see that the first eight questions are 
coded for the major skill code "1" (one-step problems: addition and 



subtraction). Next the student would count the pluses in the "Response" 
row for the items in this group (7 correct responses). Then, nwving to 
Figure 2, the student would circle the 7 in the row of numbers beside 
skill area 1: single-step problems— addition and subtraction. This 
process would be repeated for skill areas 2 and 3 in the same manner, 
resulting in circles being drawn around the 5 in the middle row and 
around the 2 in the third row of Figure 2. 

Since the student scored well into the "Satisfactory progress 
likely" colunn in skills 1 and 2 and somewhat lower in skill 3, it could 
be hypothesized that the student solves one-step problems easier than 
multiple-step problems. Since all four of the basic operations are 
involved in the first two skills, it could be further hypothesized that 
it is sorting out the elements of the multiple-step problems that is 
causing the difficulty in skill area 3 rather thdn computational errors 
alone. 

Typical Time Allocations and an Outline for Conducting the Interpretations 
in Fourth. Fifth, and Sixth Grades 

A typical pattern of the events and time allocations for conducting 
the interpretation in grades 4, 5, and 6 is given in Figure 3. These 
patterns were established in relatively small classes (les.> than 25 
students) of about average overall ability. The time allocations need 
to be adjusted for accelerated or slower groups, but the overall change 
needed generally is not more than a few minutes. 

The interprets* '^n process followed the format given below in 
outline form: 

I. Introduction. 

A. Description of the test' and reminder to the students of 
what the questions were like. 

B. What the test measures and does n'>t measure. 

C. Reasons for taking tests. 

D. Feelings that people have about taking tests. 

E. Importance of finding out what the test results mean. 

II. Reading the Pupil Item Response Record. 

A. General organization of the sheet. 

B. Through examples, teach students to read each row of 
information contained" on the report. 

C. Relate the data presented on the report back to what it 
represents in terms of test questions and skills; tested. 
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Figure 3. Time U^e in the Interpretation Process. 



Mll^UTES 



FOURTH GRADE 



Kdnd out materials, 



Introduction, 



10" 



15' 



2 0 I Guided practice. 



25 



30' 



:5' 



students complete 
Skill Summary 
Sheets with 
monitoring. 



I 



40 



45 



50 



Explain Pupil I ten 
Response Record. 



Explain Skill 
Summary Sheets, 



FIFTH GRADE 



lTahid~mjt material s , 



Introduction. 



Explain Skill 
Summary Sheets. 



31 Guided practice. 



V- 



Interpretation 
guidance for the 
completed SkiTl 
Summary Sheets. 



Interpretation 
guidance for the 
completed Skill 
Summary Sheets. 

'Sumrnary"and* 



.i 



TlXTH GRADE 



H^i^HOut materials. 



Explain Pupil Item 
Response Record. 



Students complete 
Skill Summary . 
Sheets with 
mohitoring. 



"Introduction. 

Explain Pupil Item 

Explain Skill 
Guided practice. 



Students complete 
Skill Summary 
Sheets with 
monitoring. 



Summary and 



Interpretation 
guidance for the 
completed Skill 
Summary Sheets. 

'Summary and 
.-9§D§r§l-9y§5^i9D5: 
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III. Using the Skill Summary Sheet to summarize performance. 



A. Identify skill codes and modeU through exanples, how to 
match up skill codes on the Skill Summary Sheet to those 
on the report form. 

B. Instruct students-to cotint the number of correct answers 
(+'s) for each skill area tested and to circle the 
corresponding nuinber on the Skill Summary Sheet. 

IV. Students complete their Skill Summary Sheets. 

A. Monitors should circulate through the group during this 
time, providing assistance for students who have problems 
and spot checking to be sure the students understand the 
process. 

^ V. how to Interpret and^use the Skill Summary Sheets. 

A. Define the three categories of performance. 

1. Satisfactory progress likely: chances are good that 
tiie student has developed a working level of the 
skill tested. 

2. Possible problem area: chan*^^^ are good that lessons 
requiring this skill will be difficult for the 
student. 

_ 3. Mere information needed: performance was neither high 
enough nor low enough to make a well-founded guess 
about the development of the skill. 

B. Along with the definition of the performance categories, 
encourage students to check with their teachers to be sure 
thdt the skill tested matches the curriculum sequence of 
the school. Some skills may be tested at earlier levels 
than they are taught in a particular curriculum. Low 
performance on those skills may be anticipated* and, thus, 
should not be sources of undue concern for the student or 
teacher. 

C. Look for skill areas within a test that deviate markedly 
from the general pattern of scores. 

D. Caution students that high performance does not necessarily 
mean the skill has been mastered and that low performance 
does not mean that the student knows nothing about the skill 

VI. Summarize the process and encourage the students to discuss the 
results with their teachers and their parents. 



PART III: THE IMPACT STUDY 



As noted in the Introduction to this report, there is a substantial 
body of literature related to the need for providing feedback to students 
on the outcomes of tests they have taken* There are also a number of 
approaches to interpretation to be found in recent literature' 
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(e.g. Bohning, 1979; Bradley, 1978; Cummings, 1981; Rudman, 1977). 
However, empirical evidence of the impact of these interpretation 
approaches on either teachers or students is la.cking. 

The major concern of this study was to determine whether the 
interpretation technique, reported by Cummings (1981) and described 
in Part II of this report, had any impact on student or teacher 
attitudes or knowledge about the standardized achievement test in 
use in their schools. The results of this study, in conjunction with 
the results of the two complementary studies presented in Parts IV 
and V of the report, constitute an initial attempt to validate this 
interpretation technique. 

METHODS 

Participants 

Sampling was done by school building. Six building principals 
were contacted, and they agreed to participate in the impact study. 
The buildings were located in five public and one parochial district 
in eastern Iowa. Jhese small city and rural districts ranged in 
total enrollments from 106 to 3,316 students. ^ All -fourth-, fifth-, 
and sixth-grade teachers and students in the buildings participated. 
Fall administration of the Iowa Tests of Basic Skills was part of their 
regular testing program. Because of the voluntary nature qf the samp- 
ling ft would"beTair to assuw that the lrai Ming T)rinci pals were 

positively inclined toward standardized testing and test interpretation. 
To the degree that principals* attitudes are related to teachers* 
additutes, the teachers would have been somewhat positive towacd 
testing as well. The same reasoning applies to students. 

This pjssible bias did not, however, affect the outcomes of the 
impact study because the classrooms within each building were randomly 
assigned to control and experimental groups. Durijig the period from 
October 15, 1980, through January 25, 1981, all students received the 
test interpretation session with their teachers in attendance. The 
impact of the interpretation session was measured with a teacher ques- 
tionnaire and a student questionnaire. The ocntrol teachers and 
students responded to their questionnaires just prior to the session, 
and the experimental teachers and students responded just after the 
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session. Table 1 presents the sample ^irres. -^f one divides the 
number of students by the number of teachers in each cell of the table 
disparate class sizes will be observed. There are two reasons for 
this. First, in one school district fifth and sixth graders were in 
combined classes; the students indicated their ^*^ade--4#vel-on-ibe- 
questionnaire and were included in the impact studj^-. However, the 
sixth-grade teachers were not included in the analysis. Second, 
several teachers invited guest teachers to attend the sessions and 
these teachers, in turn, responded to the teacher questionnaire and 
were .included in the study. 

Table 1. Participants in Study. 





Grade 4 


Grade 5 


Grade 6 


Total 


Teachers 


Students 


CO 

- - 

a; 

u 
to 
a; 

H- 


Students 


Teachers 
students 


Teachers 
Students 


Control 
Experimental 


5 
7 


112 
97 


10 
13 


292 
221 


9 252 
6 133 


24 656 
26 451 


Total 


12 


209 


23 


513 


15 385 


50 1,107 



Instruments 

The teacher questionnaire . ' Twenty-one of the 35 items on the 
teacher questionnaire addressed teacher attitudes about the Iowa Tests 
of Basic Skills (ITBS), and 14 items measured knowledge of the ITBS 
(See Appendix A). The first 13 items asked teachers about their 
perceptions of the value of uses of ITBS results. On the basis of 
item content, five subscales were constructed wit-h "extremely valuable 
coded as 6, "very valuable", coded as 5, and so on to "not valuable" 
coded a;> 1. The items, selected for each scale are listed below: 

a) Reporting to others (mean of items 1, 2, and 3); 

b) Use at the individual student level (mean of items 5, 7, 10, 
and 12); . 



c) Use at the student group level (mean of i terns 6, 8, 9, 11, ^ 
and 13); ~ ~ 

d) Instructional purposes (items 5, 6, 7, 10, 11, 12, and 13); 

e) Administrative purposes (items 1, 2, 8, 9, and 11)- 

A utility scale was formed from items 14, 15, 16, 18, and 19 by 
codrng the-inot^Hwgati ve r^spons^-^s^ l-aml-lhe mos.t^Qsitij/e_response 
as 5 and computing the mean. This scale overlaps the content of 
several vali'e scales, but the utility items were not included on the 
value scales because the response formats were different. I tens 17, 
20, and 21 were analyzed separately. 

A 14-item knowledge scale was formed from items 22 to 35. Both 
multiple-choice and true- false questions were included', and the correct 
answers are indicated on the questionnaire in Appendix A. Most of the 
items asked about uses and interpretation of ITBS results; three items 
asked about interpretation of subskill results in general (items 25, 
32, and 33). The median item difficulty (percent of correct answers) 
was .79, ?nd the median corrected item to total score correlation was 
.25. Coefficient aplha was .54 which is respectable for a 14--item test 
which did not purport to measure a unitary trait. 

In addition to the questionnaire, teachers were asked to complete 
a shor^t evaluation form about the interpretation session. Both the 
experimental and control teachers completed the form after the session. 
The questions included on the form are described in the results section. 

The student questionnaire ^ The student questionnaire (See 
Appendix B) ^Ua GaAtaiiied- attitude and knowledge questions. Seven 
of the ten attitude items adoressed attitudes about tne ITBS and were 
analyzed in the impact study (three of the attitude items asked about 
teacher-made tests and are not discussed). The content of the 
attitude items did not lend itself to the formation of subscales, and 
they were analyzed separately. The most positive response was coded 
as 5 and the least positive response as 1. 

The remaining 14 items assessed student knowledge about achievement 
testing. Nearly all of the items ask about purposes of the ITBS. All 
of the items are true-false, and the correct answers are listed in the 
questionnaire (Appendix B). Jor fourth-grade students, the knowledge 
questions were read aloud to the students, and students marked a 
machine-£corable answer sheet to record their responses • In the other 
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grades, and in all grades for the attitude questions, students read 
the questions and marked their responses on the answer sheet. The 
median item difficulty was .67 and median corrected item to total 
score correlation was .19. The coefficient alpha was .47. 

As with teachers, both experimental and control students 
answered evaluation questions about the interpretation session just 
after the session. These que stiorTs are^descn^ resxrHs^v 

Data Analysis 

The main statistical tool used in d^*ta analysis was a 2 x 3 
analysis of variance, with two levels of treatment group (experimental 
and control) and three levels of grade (fourth, fifth, and sixth 
grades). This procedure tested the effect of the experimental -contror 
group membership (the effect of the skill interpretation session) and 
the effect of the grade level on the attitude and knowledge scale 
scores. Analysis of variance was also used to test effects upon 
individual attitude items. ^ 

RESULTS 

Impact of Interpretation on Teachers' Attitudes and K no wledge 

Table 2 presents the results from' the analysis of variance for 
teachers' attitudes and knowledge. - None of the main effects of treat- 
ment or grade level were significant at the .05 level. One of tf^e 
interactions was significant at the .05 level, but further analysis 
yielded no worthwhile interpretations. The interpretation of these 
data is strai iht forward— the short-term impact of the interpretation 
session on teacher attitude and knowledge was neligible. 

In spite of the finding that short-term impact on teacher atti- " 
tildes and knowledge was not found, the teachers' evaluations of the 
sessions indicated that the interpretation was perceived as worthwhile. 



Analysis of variance was performed on single items in spite of ques- 
tions about the equal interval assumption for Likert-type items. 
This approach allowed all analyses to be reported in a consistent 
format. Where responses to individual items served ar^ dependent 
variables, chi-square tests were also performed to test for 
relationships between treatment group and responses. In all 
.cases, the chi-square results yielded interpretations which were 
equivalent to the results from analysis of variance. 
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Table 2. Results of Analysis of Variance for Teachei Attitudes and 
Knowl edge . 





Grade 
Effect 


Treatme.1t 
Effect 


Grade by 
Treatment 
Interactiiin 


• 


F 


Prob 


F - 


Prob 


F 1 


Prob 










— - 








rcrceivcri vaiuc ui tne iido 
results fo^: 













^ 


a) Heporting to others 


1.56 


.22 


0.05 


.82 


Q.26 


.77 


^ b) Use at the individual 
student level 


0.66 


.52 


0.84 


' .37 


2.09 


.14 


c)^ Use at the student 


0.14 


.87 


0.46 


.50 


— LJD 


.20 

*- 


d) Instructional purposes 


0.27 


.77 


0.83 


.37 


3.17 


.05 


e) Administrative 
purposes 


0.66 


.52 


0.07 


.80 


0.57 




Perceived utility of LTBS 
results 


0.88 


.42 


0.93 


.34 


0.73 


.49 


Goodness of match between 
ITBS and curriculum 
(Item 17) , 


2.43 


-.10 


0.07 


.79 


0.34 


.72 


Self-rated knowledge of 
ITBS (Item 20) 


0.27 


.77 


0.96 


.33 


1.35 


.27 


* 

Overall relative quality 
. of ITBS (Item 21) 


0.17 


,85 


1.12 


.30 


0.60 


.55 


Knowledge about the ITBS 
*and interpretation of 
results (Items 22-35) 


0.47 


.63 


0.07 


.79 


0.14 


.87 



Note: F = sequential F value as grade entered equation first, treatment 
entered second, and the interaction enterad last. 
Prob = probability of getting an F value equal to or larger than 
the observed F value under null conditions (where there is no 
^ effect of treatment or grade level). 
Total sample sizes ranged from 47 to 50 because some of the items 
,were^ omi tted by some teachers. 
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In each of the sessions, the participating teachers responded to the 
five questions presented in Table 3. The percent of teachers selecting 
each response, for each question, is reported in the column to the left 
- — of 4he xiues tioa^ _ 

Table 3. Evaluations of the Interpretation Sessions by Teachers. 



Percent of Teachers Evaluation Questions Asked 

Seleetwg^^es^onse- and Response Options 









How difficult do ycu think the interpretation 




was for your student >? 


0 


1. Too di ffi cult 


100 


2. About right 


0 


3. Too easy ^ 




How would you rate student interest in the 




interpretation session? 


67 


1. Very interested 


29 


2. Somewhat interested 


4 


3. Neutral 


0 


4. Somewhat bored 


. 0 


5. Very bored 


0 


6. Don't know 




Do you think the interpretation session will 




positively affect the students* test taking 




attitudes? , 


56 


1. Yes . 


4 


2. No ' ^ ^ , 


40 


3. Not sure 




• 

Do you think that the interpretation session • 




and follow-up on it will result in. improved 




teaching/learning? 


65 


1. Yes " ^ 


2 


2. No 


33 


3. Not sure " ^ ' 




Do you thlnjc that this type of interpretation 




session is' worth, continuing next year? 


94 


1. Yes 


0 


2. No . 


6 


3. Not sure 
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One common criticism of test interpretation techniques is that 
the -processes are complex and difficult for students to understand. 
The perceptions of the teachers who participated in the interpretation 
process under study here indicate that difficulty in understanding was 
-JflPt__a_^rorJem jt any of the three grade- levels involved. In addition, 
the teachers felt that student interest in the session was very high. - 
The questions about the effects of the sessions on students' test 
taking^atti tudes and regarding improved teaching/learning, addressed 
two long-term goals of the sessions. As might be expected, a sizable 
percentage of teachers were unsure about long-term effects. However, 
an even larger percentage (around 60 percent) felt that the session 
would positively affect students* test taking attitudes and result 
In improved teaching/learning. There was an interest1ng~p'*atte 
difference on the questipn about students' test taking attitudes; 
teachers of older students predicted mrjre positive influence of the 
session on test taking attitudes than teachers of younger students. 
The percentages of yeses for the questiorf were 33 percent for fourth 
grade, 50 percent for fi fth,. grade, and 75 percent for sixth grade. 
Responses to the f\nal question as.see.sgd overall evaluation, and it is 
appa^ent that the sessions were well received by the teachers. 

Impact of Interpretation on Students' Attitudes and Knowledge 

Table 4 presents the results from the analysis of variance, for 
- ^ 
student attitudes and knowledge. — Group means for tjiose variables for 

which significant (at the .05 level) effects were obtained, are 

presented in Table 5., 

The main eff^ect of grade and interaction effects were found for 

both* how well students like' the Iowa Tests of Basic Skills, and for 

how dtfficu*!,^ thfey perceive the tests to be. Fourth- and fifth- 

••^r^ade students like^thfe tests about gqu^lly-^H and were signifi- 
cantly more, posittye about the^tests^^^han siyth graders. In' terras of 
the difficulty of the tests, fourth-grade* students perceived the tests 

^ as most difficu-lt, and fifth-grade ST:udents 'perceived them as easiest. 
The difference between fourth.- and fifth-grade responses was significant. 
Vhe other differences were not significant. 



Table 4. Results of Analysis of Variance for Student Attitudes and 
Knowledge. 





Grade 
Effect 


Treatment 
Effect 


Grade by 
Treatment 
. Interaction 




F 


Prob 


F 


Prob- 


F 


Prob- 


SeTf-rated performance on 
ITBS (Item 1) 


0.45 


.64 


" 4.84 


.03 


. 1.30 


.27 


Liking for ITBS (Item 3) 


7.36 . 


<.0i 


0.67 


.41 


8.63 


<.01 


; . OtfficuV^of ITBS (Item 


3.12 


.05 


0.60 


.44 


3.94 


.0? 


Anxiety toward ITBS (Item 
7) 


2.30 


.10 

0 


0. 15 


.70 


0.66 


.52 


Goodness of match between 
.ITBS and curriculum 
(Item 8) 


0.66' 


.52 


,0.62 

4 


.43 


1.48 


.23 


Self-rated knowledge of 
ITBS (Item 9) ? 


0.17 


.85 


0.01 


.97 


- 0.65 


.52 


Personal utility of ITBS 
results (Item 10) 


0.94 


.39 


0.05 


.83 


1.67 


.19 


Knowledge of ITBS pur- 
poses (ItetDs 11-24) 


23.94 


<.01 


8.76 


<.01 


0.82 


.44 



Note: F = sequential F value as grade entereti equation first, treatment 
entered second, and the interaction entered last. 
Prob = probability of getting an F y^ue. equal to or larger than 
the observed F value under null conditions (where there is no 
effect of treatment or grade level). 

Total sample sizes ranged from 1,104 to l,107^because not all of 
the students completed all of the items. 
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Table 5. Group Means for Selected Student Attitude Items and Knowledge. 





Group 


Grade 4 


Grade S 


Grade 6 


Total 


Liking for ITBS (Item 3) 
I really like them = 5 
1 real ly nate uiem ~ i 


Experimental 
Control 

1 0 tal 


2.89 
3.49 


^ 3.21 
3.09 

7 1 /I 

J* Ih 


2.89 
2.93 

9 09 
C • 74 


3.04 

1.20 


Difficulty of ITBS 
(Item 4) ^ ' 
Very hard = 5 ' 
very easy - j. 


Experimental 
Control 

1 0 ta 1 


3.50 
3.26 

J* J/ 


3.17 
3.26 

J. 


■ 3.35 
3.26. 


3.30 

3,26 


Self-rated performance 
' on ITBS (Item 1) 

v^U 1 Lc 1 r 1 yit U 

' Quite low = 1 


Experimental 

uuii tru 1 

Total 


3.50 
3.37 


3.37 
3.31 


3.31 
3.30 


3.38 

3 27 


Knowledge of ITBS pur- . 
poses (Items 11-24) 
number of items correct 


Experimental 

Control 

Total 


9.39 
8.73 
9.04 


9.65 
9.14 
9.36 


10.24 
10.00 
10.08 


9.77 
9.40 



The interactions were such that, at the fourth grade, the control 
group liked th'^ ITB$ more and rated the difficulty lower than the 
experimental group. .Across all grades, however, the control and 
experimental group were similar in their attitijdes about the tests. 

There were significant differences between the experimental and 
control groups on how welT they thought they had done on the ITBS. At 
each grade level , the experimental group rated their performance higher 
than did the control. The group means are listed in Table 5, and 
Figure 4 illustrates the frequencies for all grades combined. 

The experimental group also performed significantly higher on the 
14-item knowledge test.^ Table 5 shows that the experimental group was 



though the students were randomly assigned to treatments, one could 
argue |hat the results of self-rated performance on the ITBS and 
the knowledge scale simply indicate that the experimental students 
were brighter tt^n the control students. This would cast doubts 
on the significant treatment effects. In order to test this hypo- 
thesis, an analysis of covariance was performed. The grade and 
treatment effect upon the knowledge scale continued to be signifi- 
cant after adjusting for possible group differences in self-rated 
performance. 



Figure ^4. Frequency Distributions' of the Experinental and Control 

Group for Item 3 ('How well do you think you did on the IcAva 
Tests of Basic Skills this year?). - 
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higher at each grade level. The main effect of grade was also signifi- 
cant with sixth graders scoring higher than fifth graders and fifth 
graders higher than fourth graders. 

r 

• The students also answered questions specifically about the inter- 
pretation session— how interesting the session was, how confusing it 
was, and how helpful it might be in future learning. These evaluation 
questions were asked of all students jut't after the session and. 
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therefore, a breakdown for experimental and control groups was not 
appropriate. The total sample size ranged from 1,045 to 1,053, some- 
what less than for previous results because two classes did not 
complete the evaluation questions. The results u. resented in 
•Table' 5. 

Table 6. Evaluations of the Interpretation Sessions by Students. 

Percent of Students Evaluation Questions Asked 

Selecting R'esponse and Response Options 



Do you think the skill session was interesti ,ig? 

54 1. Yes 
29 ^ No 

17 , 3. Not sure 

Do you think the skill session was confusing? 
20 1. Yes ' • 

55 2. No 

25 , 3. Not sure 

.Do you think that knowing your strong and weak 
areas will help you learn better? " 
77 \ ' 1. Yes 

/ 7 ■ 2. No , 

16. - 3. Not sure 



^ Th'e students viewed the session as interesting, but their overall 
level of iqiarest was not as high as their teachers perceived it. The 
'grades did not significantly di'ffer in reported inter^pst level. Althoug"h 
100 percent of the teaches rated the difficulc. of the session to be 
about right, only 55 percent of the students reported no problems 'with 
being confused by the session. There was a significant gradi- effect on 
this question. The sixth graders reported being less confused than the^,. 
fourth and fifth graders. Ther^ was also a grade effect on the las, 
question. Sixth graders were significantly^ higher' than fifth graders, 
and fifth gra ■2rs we « sigmfic--"tly higher 1^^r\ fourth graders on their 
— ^0 

^ne way analysis of variance was conducted to investi,gate the effects 
of grade level on these items. Significant effects were further 
.tested with Duncan's Multiple Range Test. Item reS>ponses were 
coded such that Yes = 3, Not sure = 2, and No = 1. 
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perceptions of the helpfulness of knowing their strong and weak areas. 
Across all grades the students were very positive about the effects 
on learning, inore so, ir fact, than the teachers. 

CONC LUSIONS FOR THE IMPACT STUDY 
The interpretation process had no inmedit...j impact on either 
teacher attitudes/opinions or on teacher knowledge about the test, as 
assessed for the stuc(y. However, the evaluations of the inte. pretation 
sessions by teachers indicate that the sessions were positively received, 
were thought by most teachers to have the potential for positive effects 
in both future testing and teaching situations, and were considered by 
almost all teachers (94 percent) to be worth continuing next year. For 
students there was evidence of positi ve- impact of the interpretation 
session. One important finding was the sighificant^difference in 
knowledge found between students who had and those who had not been 
through the interpretation session. Cormany's (1974) study of attitudes 
toward standardized testing concluded that persons who felt they were 
well informed about the subject had more positive attitudes about it. 
If the increase in knowledge about the test, generated through the 
interpretation session, leads to feelings of being well informed (or 
better informed) about the test, then general attitudes toward the 
test may be imprt)ved over. the long run. 

The emphasis of this discuss'ion of attitude change resulting from 
greater knowledge, however, must be on the long term potential effects, 
since no group differences were found for the short term effects of the 
interpretation session. The experimental and control groups did not 
differ in their ratings of several dimensions of the test, perceived 
utility of test results, or how nervous they felt before entering the 
testing situation. 

The one opinion item which appeared to be directly affected by 
the interpretation session conce'^ned how well the students thought 
they had done on the Iowa Tests of Basic Skills. Those students who 
participated in the interpretation process felt they had done better 
on the test than those who did not participate (see Figure 4). These 
results are particularly relevant in view of the frequent criticism 
that standardized tests may damage students' self-image'. Figure 4 • 



shows that' student self-ratings, in general, clustered around the average 
rating with some skewness on the above average side. If the critltvsjii 
were valid, the distributions would be skewed in the opposite direction. ' 
The findings further suggest that if test results are not interpreted 
with active student participation, students tend to rate themselves 
lower on the test and have a lower self-image of their abilities to 
achieve' in school . 

In sumnary, the interpretation process had an immediate imoact on 
student knowledge about the Iowa Tests of Basic Skills and on the 
students' views about how well they had performed on the test. Other 
attitudes and opinions about the test were not immediately affected 
by the interpretation process. Further research on tlie long term effects 
on attitudes is needed. 

PART IV: TEACHERS' PREDICTIONS OF STUDENT PERFORMANCE 
ON SUBSKILLS OF MATHEMATICS 

A rarely asked, but critical, question for educators concerned 
about testing and the use of test results has been succinctly posed 
by Fitzgerald (1980): "Do tests provide much information about 
children's performance that teachers don't know by classroom observa- 
tion?" (p. 216). If current practice is to be' used as a guide in 
answering this question, then it might be said that teachers believe 
classroom observations are overwhelmingly the most useful of the 
ways of assessing students. Salmon-Cox (1981) recorded the finding 
that teachers most heavily depend on obsei -ion, perhaps, bringing 
into serious question Ebel 's (1972, p. 49) assertion that "the 
majority of teachers and professoi f are keenly aware of the limited 
and unsatisfactory bases they ordinarily have for judging the 
relative achievement of various students and of the fallibility of 
their subjective judgments when based on the irregular, uncontrolled 
observations they can make in their classroom or office." 

This study was an attempt "to address one aspect of Fitzgerald's 
(1980) question. Teachers were asked to predict how their students 
would perform on the Mathematics subtests of the Iowa Tests of Basic 
Skills. The predictio ., which were made both in terms of criterion- 
referenced (raw-score) performance and norm- referenced (percentile-rank) 
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performance, were later compared to the actual perfonnance obtained on 
the tests by the students. The basic questions asked in this study 
were: 

1. How highly correlated are the predicted and observed 
scores of students for subskill scores and for subtest 
total scores? 

2. Do teachers tend to be accurate in their predictions 
of student obtained scores, and if over- or under- 
predictlons occur, are they systematic? 

3. Are there systematic differences in predictions that 
appear to be ^elated to either grade, student sex, or 
overall mathematics achievement? 

4. What relationships exist between predictions made on 

the basis of raw-score versus norm- referenced estimates? 

METHODS 

Participants 

Forty-three fourth-, fifth-, and sixth-grade teachers', and a 
ranBoin sanple of 374 of their students participated in this study. 
An average of between eight and nine students per teacher was selected, 
with a maximum set at ten students per teacher per class, in order to 
keep the number of predictions an individual would have to make within 
reasonable limits. The final student sample included 140 fourth graders, 
117 fifth graders, and 117 si xtn graders. In each grade 55 percent of 
the students^ were males and 45 perQent were females. 

Of the teachers participating, 21 were fourth-grade, 14 were fifth- 
grade, and eight were sixth-grade teachers. Soqe of the sixth-grade 
teachers were mathematics teachers in a middle-school setting and, thus, 
made predictions for students from more than one class. The remaining 
teachers wpp© responsible for the full range of teaching in self-contained 
classrooms. The participating teachers were drawn from six medium-sized 
to small schJol districts in eastern Iowa. Five public school districts 
and one private school parti ci paced. 

Data CoTiectipn 

Carly in the fall senester of the 1900-81 school year, school 
districts and teachers were solicited for participation in the study. 
The students for whom predictions were to be made were selected 
randomly from class lists, and a Skill Summary Sheet for the appropriate 
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form and level of the test was prepared for each sampled student (see 
page 6 of this report for an example of the Ski* 11 Summary Sheet format). 
One to two weeks prior to the administration of the Iowa Tests of Basic 
Skills, the teacher was sent the Skill Summary Sheets for the sample 
of students from his/her class, along with directions for completing it. 
The directions instructed the teacher to "...circle the score you think 
the student named on the form will receive for each skill area listed," 
The three divisions of the Skill Summary Sheet were briefly explained, 
and an example was given with the directions. In addition, the teachers 
were asked to estimate the percentile rank that the student would obtain 
on each of the three mathematics subtests. Concepts, Problem Solving, 
and Computation. This predfetion^as done tnr^ scale as shown in 
Figure 5. The request for a norm- referenced prediction on the basis of 

Figure 5. Percentile Sc^le and Directions Used in Predicting Student 
Performance oh- Mathematics Subtests. 



Please estimate the percent of students in this grade state- 
wide that this student is likely to score better than in 
Mathematics (subtest name) overall. 

Percent: 1 10 20 30 40 45 50 55 60 70 80 90 99 
(Circle the percent closest to your estimate.) 



state, rather than national, norms was used because all of these school 
districts use state norms generated through their participation in the 
Iowa Testing Programs, and the teachers were accustomed to using state 
norms. 

The predictions were made and returned to the project director before 
testing occurred in the regular school -wide testing program. The 
teachers had worked with the students a minimum of five weeks before the 
predictions were made. 

Following the testing of the students, their Pupil Item Response 
Records were obtained from the schools and were used to calculate the 
raw score performance for each subskill tested for each student (see 
page 5 of this report for an example of the PujjsH Item Response Record 
format). The actual percenti le-rank performances of the^tuden^ts were 
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also obtained from their reports. These data, along with grade equiva- 
lents and the various predictions made by the teachers were coded and 
keypunched for the data analysis. 

Data Analysis 

The data analyses involved various cofnparisons between predicted 
and observed scores on the three subtests under study and the subskill 
categories comprising each subtest. The analyses included study of the 
effects of norm- referenced versus raw-score frameworks in making the 
estimates of performance, the accuracy of the predictions and over- 
or under-prediction, and the effects of student sex, grade level, and 
general mathematics achievement on the predictions* 

The relationships between predicted and observed scores were 
established through correlational analysis. Raw-score estimates of 
performance and obta^'ned raw scores were correlated, using Pearson 
product-moment correlations, for each subskill on the three mathematics 
subtests, and for total raw-score estimates for each subtest. The total 
raw-score estimates were calculated by summing the teachers* predictions 
on each subskill. In addition, correlations were computed for the 
predicted and observed percentile ranks for each subtest. 

The accuracy of performance was analyzed in two ways. First, 
frequencies and percents were computed for discrepancy intervals, 
using predicted percent correct minus observed percent correct and 
ten unit intervals. Thus, with perfect prediction equal to zero, the 
interval indicating the greatest accuracy of prediction was the interval 
-4.9 to 4.9. The percent of students (i.e. predictions) for each 
subtest, falling within the discrepanfiy intervals was found from the 
frequency distributions. In addition to this descriptive approach, 
t- tests of differences between means uere calculated for the subtest 
total scores and for the subskill scores. The t-tests for the raw-score 
estimates for subtest scores were based on the difference in percents 
between predicted and observed. For the norm-referenced calculations, 
the predicted an<j observed percentile ranks wcra converted to z scores, 
then handled in the same fashion as the raw-score estimates. 

Correlations and t-tests were also confuted for the subtest scores 
in^alyses^ofjWiejr^^ ~ 
general matehmatics achievement level. To establish the mathematics 
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achievement level, the grade-equivalent scores of the students were 
summed across the three subtests (Concepts, Problem Solving, and 
Computation), and quartiles were computed for this grand-total mathe- 
matics score. The achievement groups were defined as: Low Achievers, 
the bottom 25 percent of students on this total score; Average 
Achievers, the middle 50 percent of students; and High Achievers, the 
upper 25 percent of students. These cutoff levels for low, average, 
and high were used because they are consistent with common practice 
in test interpretation for cefining above average, average, and belcw 
average performance. 

RESULTS AND DISCUSSIO N 
Overall Correlations and Accuracy of Predictions 

The correlations, based on estimates of raw-score performance for 
the subtests and the subskills are presented in Table 7. The correla- 
tions for the subtests are all in the 50s and are somewhat higher than 
the subskill correlations. This difference between subskill and subtest 
correlations was anticipated because of the decrease in reliabilities 
of both the test and the estimate when the shorter subskill scores are 
estimated. These correlations, in general, indicate a moderate amoun : 
of agreement in the rankings of the predictions and the actual scores 
obtained by the students. 

To further assess the agreement between predictions and obtained 
scores, refer to Table 8. In Table 8, the percent of predictions about 
each of a student's subtest scores, falling within specified ranges of 
accuracy, is presented. Perfect prediction, which would be a discrepancy 
score of 0, is contained in the interval 4.9 to -4.9. The interval is 
essentially ten units wide, and represents the discrepancy between the 
predicted percent correct and the observed percent cor»^ct. Thus, it 
can be seen that 17.4 percent of the predicted Concepts scores were 
within +4.9 percent of being perfect predictions. It can further be 
seen that an additional 20 percent of the predictions were between 5 
and 14.9 percent too high and that 17.9 percent of the students' 
Concepts scores were under-predicted by between 5 and 14.9 percent. It 
can be seen from the aata In Table 8'^that the tendency was for teechers 
4Q- und c r - p r edlc^fernftctiia^l-sttfdent- p er f o nM nc e" l7T~tDTrce]TEs to~6 ve^^^ 
predict performance in Computation. 
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Table 7. Correlations Between Predicted and Observed Scores by 
Subtest and Subskill. 



SUBTEST 


Norfn- Referenced 


■Raw- Score 


Subskill 


Predicted VS Observed 


Predicted VS 


Observed 




n r 




r 


CONCtPTo 


319 ^bb 


374 


,53 


Numeration 




374 


.38 


Equations 




374 


.41 


Whole Numbers 








Fractions 




373 


.32 


ueci t(\di\s/% 




154 


. oZ 


Geometry & Measurement 




373 


.36 


PROBLEMS 


319 .57 


374 


.52 


1-step, - 




374 


.34 


1-step, X, 4 




374 


.43 


multiple-step 




374 • 


.45 


COMPUTATION 


319 ' .59 


374 


.58 


Whole Number + 




374 


.39 


Whole Number - 




374 


.45 


Whole Number x 




374 


.54 


Whole Nuirber 4 




372 


.51 


Fractions + 




234 


.35 


Fractions - 




198 




Fract|ions x 




117 


.24 


Decimals + 




117 


.42 


Decimals - 




117 


.16* 



*not significantly different from r 0, p <.01. 



Table B. Percent of Students for Whom Raw-Score Predictions Were 
Accurate Within Specified Discrepancy Intervals. 



Prediction Discrepancy Intervals* Percent of Students 

(predicted % correct - Within Discrepancy Ranges 

observed % correct) 

Concepts Problems Computation 



>35.0 


3.7 


5.6 


5.1 


25.0 to 34.9 


4.0 


7.5 


7.2 


15.0 to 24.9 


9.6 


11.2 


14.2 


5.0 to 14.9 


20.0 


12.8 


21.7 


4.9 to -4.9 ■ 


17.4 


25.4 


' 24.1 


-5.0 to -14.9 


17.9 


16.0 


15.5 


--15.0 to -24.9 


13.6 


11.8 


6.7 


-25.0 to -34.9 


9.6 


6.2 


4.6 


<35.0 


4.0 


3.5 


1.1 



♦perfect prediction = 0. 



Table 9 carries the analysis of over- and under-prediction one step 
farther. In Table 9, the mean raw-score percents correct for predicted 
and observed are provided, along with t-tests of differences between 
the means. For the subtests, the conclusion reached in the consideration 
of Table 8 is supported, since a significant under-prediction was found 
for the Concepts subtest and a significant over-prediction was found for 
the Computation test. 

It can also be seen in Table 9 that there is fairly consistent over- 
prediction for the subskills of the Computation subtest. This over- 
prediction mi^t be attributable to a belief by teachers that the funda- 
mental skills of mathematics computation have been achieved to a higher 
degree than they actually h£(ve. However, since the Mathematics 
Computation subtest of the Iowa Tests of Basic Skills is a more speeded 
test than any other part of the battery, it is likely that the speeded- 
ness of the subtest contributed to the over-predictions. When the raw- 
score estimate is compared to the norm- referenced estimate, presented 
7~4h"Table 10,'Tt is fdun^"thdt~ty^" relative standing (percentile ranks) 
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Table 9. 



T- tests of Differences Between Mean Predictad and Observed Raw-Scores by Subtest and 
Subskill. 



SUBTEST 


Mean 


Mean 










Subskill 


Predfcted 


Observed 












* i/Orrect 


h i/UrrcLU 


n 


c 


n 


t 


CONCEPTS 


54.2 


56.4 


.023 


.195 


374 


-2.25 


Numeration 


61.5 


60.7 


.008 


.246 


373 


.66 


Equati ons 


dU. o 






inn 


0/4 




Whole Nunbers 


58.7 


58.5 


.002 


.266 


374 


.14 


Fractions 


A A 0 

44. J 




noi 

-.uyi 




0/0 


•3 • OO 


Decimals/^ 


44.3 


39.2 


.051 


.399 


154 


1.58 


Geometry & Measurement 


47.7 


54.9 


-.072 


.265 


373 


-5.26 


PROBLEM SOLVING 


59.8 


58.6 


.012 


.214 


374 


1.11 


l*stept +» - 


74.9 




.073 


.254 


374 


5.53 


l*stept X, <t 


55.7 


52.5 


.032 


.301 


374 


2.03 


mul tip1e*step 


48.8 


54.4 


-.056 


.248 


374 


-4.35 


COMPUTATION 


65.3 


60.6 


.047 


.182 


374 


4.96 


Whole Number + 


82.5 


78.4 


.042 


.224 


374 


3.60 


Whole Number - 


76.9 


70.8 


.061 


,247 


374 


4.75 


Whole Nunt>er x 


62.6 


61.6 


.009 


.236 


374 


.77 


Whole Number f 


51.8 


50.9 


.008 


.292 


372 


.54 


Fractions + 


42.4 


28.3 


.140 


.338 


234 


6.34 


Fractions - 


42.3 


26.8 


.159 


.301 


198 


7.43 


Fractions x 


49.3 


32.2 


.171 


.388 


117 


4.76 


Detlmals + 


79.5 


57.7 


.218 


.407 


117 


5.80 


Decimals - 


48.3 


50.8 


-.026 


.440 


117 


-.63 



*f <.05. 



SignificaPit* 
over (>) or under 
prediction 



«) 



< 

> 
< 
< 

> 

< 

> 

> 
> 



> 
> 
> 
> 
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of the :>tudents were more accurately predicted; thus, providing support 
to the hypothesis that speedpdness may have affected the raw-score 
estimates. 

The subskill predictions in Table 9 for the Concepts and Probltjm 
Solving siA)tests, unlike those for the Computation subtest, are not 
consistent in their directionality. However, a pattern of over- and 
uiider-prediction can be distinguished. The subskills are presented 
1n a roughly hierarchical order in Concepts and Problem Solving, and it 
can be seen that the pattern of over- or under-prediction followed an 
ascending order of skill complexity. That is, the lov/er level skills 
tended to be over-predicted and the rrcre complex sMlls under-predicted. 

Table 10 presents the t-test data for the norm-referenced predic- 
tions, is the mean difference between predicted and observed 
percentile ranks that were converted to z scores. None of the t- tests 
were significant for the norm-referenced framework. This indicates 

Table 10. T- tests of D' ''^rences Between Mean Predicted and Observed 
Percentile Ra . (converted to z scores) by Subtest* 



'Subtest IT, n t - p 



Concepts 


.038 


.863 


319 


.73 


.43 


Problem Solving 


-.096 


.897 


319 


-1.92 


.06 


Computation 


.031 


.851 


319 


.66 


.51 



that the teachers showed less tendency to over- or under-estimate a 
student's relative performance (norm-referenced) than was true with 
the raw-score (criterion-referenced) predictions. 

Correlations and Accuracy of Predictions by Sex, Grade, and Achievement 
Level 

In an effort to determine which factors influenced the accuracy of 
the predictions about student performance, analyses were conducted on 
subgroupings based on student sex, grade level, and mathematics achieve- 
ment as previously defined. The results of these analyses are presented 
here beginning with grade and sex, then turing to mathematics achievement. 



Table 11 presents correlations for predicted dnd observed scores 
for both the raw-score and norni-referencod approaches by sox and 
grade. These correlations are fairly consistent across grades, sexes. 



Table 11. Correlations Between Predicted and Observed Scores, by Sex 
and Grade. 





1 


Raw-Score 
Prediction 


Norm-Referenced 
Predicti on 










lA 


o 




</) 
4-> 




o 

Oi 








■•-> 
o. 

0) 


B 
0) 


4-> 




o. 

a> 






X 
0) 

to 




c 


JQ 
O 


& 




c 




i- 

Q 




z 


o 
o 


L. 

Q. 


o 
o 


z 


o 


Q. 


o 


Grade 4 


M 
F 

Total 


77 
63 
140 


.58 
.50 
.54 


.52 
.48 
.50 


.59 

.45- 

.5,3 


63 
56 
119 


.70 
.59 
.65 


.72 
.62 
.67 


.74 
.55 
.65 


Grade 5 


M 
F 

Total 


64 
53 
117 


'.44 
.48 
.46 


.41 
.57 
. .48 


.64 
.58 
.62 


55 
49 
104 


.41 
.56 
.49 


.33 
.64 
.48 


.50 
.42 
.48 


Grade 6 


M 
F 

Total 


52 
65 
117 


.58 
.60 
.59 


.54 
.62 
.59 


.67 
.67 
.66 


43 
53 
96 


.52 
.54 
.52 


.63 
.56 
.58 


.62 
.66 
.65 


Combined 
Grades 


M 
F 

Total 


193 
' 181 
374 


.54 
.53 
.53 


.48 
.55 
.52 


.62 
.54 
.58 


161 
158 
319 


.58 
.52 
.55 


.55 
.59 
.57 


.64 
.53 
.59 

1 



} 



subtests, and prediction approach (the two exceptions are the norm- 
referenced correlations for fifth grade. Problems boys versus girls, 
z « -2.05, p<.05; and for the sare subtest when compared to the 
fourth-grade correlation, z =^ 2.12, p <.05*). This finding suggests 



*These statistics were computed using the formula: 

J V(nj - 3) + VCn^ - 3) ' 

where: z = test of significance for differences in two correlation 
coefficients, zr = a z- trans formation of the correlation 
coefficient, n = number of subjects in the sample. 
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that across most of the predictions of student performance the relation- 
ships between predicted and observed scores are similar. 

In respect to the accuracy of predictions, however, fewer simi- 
larities across grades and sexes are to be found. Table 12 provides 
a summary of the patterns of over- and underxpredi ct Ion of scores by 
sex and grade. Among the items of general interest in Table 12 are 
the consistent and significant over-predictions that occur for males 
in the Computation subtest under the raw-score approach. These large 
over-predictions for males and their small, but cumulatively significant, 
counterparts for females are what led to the previous finding of general, 
significant over-prediction for the Computation subtest. Also, of 
interest are tl4 relatively accurate prediction^ for the Problem Solving 
subtest, and ^'he inconsistencies 'that appear between the raw-score - 
approach and the norm- referenced approach. While the number of signifi- 
cantly inaccurate predictions is approximately the same for the raw- 
score estimates (15) as for the norm-referenced estimates (16), in a 
number of cases the direction of the inaccuracy changes between the 
two methods--for example, grade 4 Computation or grade 5 Concepts. 
These shifts in over- and un(^er-prediction may be an artifact of the 
different methods by which subtest scores were obtained (recall that 
in the raw-score approach the subtest total score was computed as a 
sum of the estimates for each subskill, whereas, the norm-referenced 
prediction was a single prediction for each subtest), or they may 
reflect some r**eal difference in a teacher's ability to predict using 
the two different approaches. The current research design does not 
address this question. 

Tables 13 and 14 present correlations and patterns of over- and 
under-prediction, respectively, for predictions for three mathematics 
achievement groupings, by grade. The correlations in Table 13 are 
all lower than those previously seen, and some are negative. This 
indicates that the modest precision that exists for the predictions when 
all achievement groups are ranked together is reduced, and in some 
cases lost altogether^ when students are^rouped along lines that arc 
consistent with common test interpre,tation practices of identifying 
above average, average, and below aveVage performance. / 
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Table 12. Patterns of Over-Predictftn and Under-Predictior^ by Sex and Grade, for Raw-Score' 
and Norm- Referenced Predi^cti ens . - 



, 1 


"CR 


^NR 






Raw- 


Score 






•Norm-Referenced 






























• 




Concepts 


Problem 
Solving 


Computation 


Concepts 


Problem 
Solving 


Computation 








Over 


Under 


Over 


Under 


Over 


Under' 


Over 


Under 


Over 


Under 


Over 


Under 


Grade 4 




























♦ 


Male 

Female 

Total 


77 
63 
140 


63 
56 
119 




X* 
X 

X* 


X 


X 
X 


X* 
X 

X* 


- 




X* 
X* 
X* 




X* 
X* 
X* 




X 

X* ' 
X* 


Grade 5 


- 




























Male 
Female 
^ Total 


64 
63 
117 


55 
49 
104 




X* 
X 

X* 


X 
X 
X 




X* 
X 

) 




X* 
X* 

X*' 


1 

/ 

N 


X 
X 
X 




^ X* 
X 

X* 




Grad» 6 






























Male 
Female 
- Total 


52 
65 
117 


43 
53 
96 


X 
X 


X 


X 
X 
X 




X* 

J( 

X* 




X 

X* 
X* 






X 

^X 
X 


X 

^ X 


X 


Combined 
Grades 






• 
























Male 

Female 

TotaT 


193 
181 
374 


161 
158 
319 




X* 

X 

X* 


X 
X 
X 




X* 
X* 
X* 

i , . . 




X 
X 
X 






X 
X 
X 


X 


X , 



*t-test of the mean differehce between predicted and observed was significant p <.05. 
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Table 13. Cv relations Between Predicted and Observed Scores, by 
Mathematics Ability Level and Grade. 









Abi 1 i ty 


Level 






• 


Low 


(N) 


Average 


(N) 


High 


(N) 


CRITERION-REFERENCED 














Concepts 
Grade 4 
Grade 5 

Ul ouc u 

Tot . 


.13 
.28 
16 
.19 


(36) 
(31) 
(32) 
(99) 


.17 
.04 
-.02 
.12 


(68) 
(55) 
(55) 
(178) 


.31 
.34 ■ 
-.16 
.27** 


(36) 
(31) 
(30) 
(97) 


Problem Solving 
Grade 4 
Grade 5 
Grade 6 
Total 


< 

.33* 
.24 
-.22 
.13 


(36) 
(ol) 
(32) 
(99) 


.07 
.03 
. 12 
.07 


(63) 
(55) 
(55) 
(170) 


t 

^ 7 

.29 
.22* 


(3P} 
(31) 

(97) 


Coitputation 
Grade 4 
Grade 5 
Grade 6 
Total 


.12 
.U** 
.00 
.11 


(36) 
(31) 
(32) 
(99) 


.22 
.35** 
.38** 
.31** 


(66) 
(55) 
(55) 
(178) • 


.12 
.12 
.30 
.17 


(36) 
(31) 
(30) 
(97) 



NORM-REFERENCED 



Concepts 
Grade 4 
Grade 5 
'Grade 6 
Total 

Problem Solving 
Grade 4 
Grade 5 
Grade 6 
Tote^l 

Confutation ^ 
Grade 4 
Grade 5 
Grade 6 
Total 



.39* 


(28) 


.26* 


-.42* 


(28) 


.11 


.16 


(29) 


. -.23 


.29** 


(8n . 


.04 


.31 


(28; 


.40** 




(28) 


.05 


-.20 


(29) 


.17 


.12 


(85j 


.18* 



.46** 'Z8) 

■.02 ,J8) 

.'20 (29) 

.20 (85) 



.28* 
.29* 
.20 
.26** 



(58) 
(50) 
(43) 
(151) 



(58) , 
(50)/ 
(43)/ 
(151,') 



(58) 
(50) 
••(43) 
.(151) 



,39* 
.21 
.15 
.25* 



.26 
.20 
.15 
.23* 



.23 
.14 
.34 
.24* 



(33.) 
(26> 
(24) 
(83) 



(33) 
(25) 
(24) 
(83) 



(33) 
(26) 
(24) 
(83) 



* r » 0, p <.05. 
** r = 0, p <.0h 
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Table 14. Patterns of Over-Prediction and Under-Predic*ion by Mathematics Ability and Grade, 
for Raw-Score and Norm-Referenced Predictions. 



[Ability Leve5 


CR 


NR 










36 


28 


Average 


68 


58 


High 


36 


33 


Grade 5 






Low ' 


31 


28 


Average 


.55 


50 


High 


31 


26 


<^ 






Grade 6 






Low 


32 


29 


Average 


55 


43 


High 


30 


24 



Raw- Score 



Concepts 



Over 



Under 



Problem 
Solving 



Over jUnder 



Computation 



Norm- Re fere need 



Conce^^s 



Overl Under 



Over 



Problem 
Solving 



Computation 



Under Over i Under Over Under 



X 

X* 



. . 6 

; X 

' X 

1 X* 



X 

X* 



X* 
X* 



X 

X* 



X* 
X* 



X 

X* 



X* 
X 



X 
X 



X* 
X* 
X 



X* 
X* 



X* 

X 



X* 
X* 



! X 



X* 
X* 



X* 
X* 



X* 
X 



*t-test of mean difference between predicted and observed was significant p <.05. 



Table 14 provides some fairly clear indications of the problems 
teachers have in estimating the penormance of students at various 
achievement levels. Among the significantly inaccurate predictions 
shown in Tabl-e 14, 100 percent of those fOr low ability students, for 
both the raw-score and norm- referenced approaches, were over-predictions. 
Notably large percentages for the average ability students were also 
over-predictions (75 percent in the norm-referenced approach , 'and 
100 percent in the raw-score approach). However, the direction of 
the inaccuracy shifted to under-prediction for the high ability students 
(in the norm- referenced approach, 100 percent of the significant predic- 
tions were under-predictions and in the raw-score approach 80 percent 
of the significant predictions were under- predictions). These findings 
suggest that for low and average achievijng students, teachers tend to 
be overly optimistic about their students' subsequent performance on 
the ter.t, and for high achieving students the teachers tend to be 
overly pessimistic. 

SUMMARY OF THE STUDY OF TEACHERS' PREDICTIONS Or 
STUDENT PERFORMANCE OH SUBSKILLS OF MATHEMATICS 
Fitzgerald (1980) asked whether tests provide information to 
teachers that they do not already know through classroom observations, 
and Salmon-Cox (1981) found that teachers' most frequently mentioned 
method of assessing their students was "observation." On the other 
hand, Ebel (1979) has contended that, "Most assessments of student > 
achievement currently being made in our schools and colleges are 
... highly subjective, uniquely individualistic, and unsystematic" 
(p. 11). Somp testing programs in their interpretation materials 
emphasize that the "test data should confirm what a sensitive teacher 
already knows about students" (Prescott, et. al., p. 44), others, 
while recognizing that test results do not replace teacher judgment, 
appear to focus more on the discrepancies between teacher judgment and 
observed performance. The Iowa Tests of Basic Skills falls within the 
latter category. 

Changes in education over time have led to increasingly formal 
settings, and teachers are now expected to deal with increasing numbers 
of students in their daily teaching activities (Chauncey & Dobin, 1963). 
This fact makes it increasingly difficult for teachers to meet Ahman 



- 36 - 



and Clock's' (1975) challenge to know the student well enough to design 
appropriate educational programs to meet the objectives of instruction. 

This study focused on the discrepancies between teachers' predic- 
tions of student achievement in mathematics and the subsequent actual 
achievement of the students involved. Teachers predicted both the 
raw-score performance and the relative standing (percentile-ranks) of 
randomly selected students in their classes. The predictions were 
based on the Iowa Tests of Basic Skills, which were in use in each of 
the classes. 

In general, it was found that teachers were not very accurate 
predictors of student performance on the test. Further, it was found 
that some systematic biases appear to exist in the predictions. Majes . 
were frequently over-predicted, and the predictions for low and 
average ability students were also generally too optimistic. On the 
otiier hand, high ability stuuents were generally under-predicted. 

To the extent that subjective judgments of mathematics achieve- 
ment are "better than" mathematics achievement as measured by the Iowa 
Tests of Basic Skills, the results of this study are diminished. 
However, the results of the study do indicate that in the absence of 
t€st results, many questions about individual student achievement in 
mathematics might never <i irface for a teacher. It is also indicated 
that biases which favor males, low ability, and average ability students 
could be brought into question through the appropriate interpretation 
of test results. 

In respect to Fitzgerald's (1980) question concerning the addi- 
tional information gained from tests over clasiroom observation, it 
appears that dt least somewhat different informtition 15 often obtained 
from the two sources. Whether one source is better than the other is 
probably not a resolvable question. However, the fact that many 
discrepancies between teacher expectancies and test performance exist 
can lead to individuals, and perhaps certain subgroups of students, 
getting a closer look, as the teacher tries to explain the discrepancies 
for him- or herself. Ultimately, the questioning may lead to a more 
appropriate and more effective teaching proqram for the student, and 
the test will have served a valuable purpose. 
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PART V: RELATIONSHIPS BETVEEN THE RESULTS OF THE 
IOWA TESTS OF BASIC SKILLS. MATHEMATICS SUBTESTS. 
AND THE STANFORD DIAGNOSTIC MATHEMATICS TEST 

One concern 1n validating a diagnostic 1nterpr«tat1on technique 
for a test such as the Iowa Tests of Basic Skills (Hieronymus. et. al.. 
1978). Is whether the survey test provides Information that is similar 
that obtained from recognized diagnostic tests. For this study, the 
Stanford Diagnostic Mathematics Test (Beatty. Madden. Gardner, & 
Karlsen. 1976) was used to compare student performances on the survey 
test and a diagnostic test. 

It has been claimed that the results of the Iowa Tests of Basic 
Skills are "not useful for makihg decisions at the level of the 
individual tltild" (Harris. 1978. p. 57). On the other hand, the claim 
has been made for the Stanford Diagnostic Mathematics Test, that it 
"is an adequate test to identify the strengths and weaknesses of 
individual pupils in the areas covered" (Lappan. 1978. p. 437). This 
stu(ly was designed to provide both a structural and statistical assess- 
ment of the similarities between the Iowa Tests of Basic Skills 
subtests in Mathematics Concepts. Problem Solving, and Conputation. and 
their counterparts of Concepts. Applications, and Cpirputation on the 
Stanford Diagnostic Mathematics Test. 

METHODS 

Structural Comparisons of the Iowa Tests of Basic Skills and Stanford 
Diagnostic Mathematics Test 

In preparation for the analyses of data collected under procedures 
described below. . a structural analysis of the two test batteries was 
undertaken. This analysis involved a comparison between the tests to 
identify corrmonali ties in the subskills tested, where the items 
assessing the subskills appeared in the respective test oatteries. and 
to identify subskill categories that were tested on one of the 
battaries. but not on the other. 

Item evaluate rs . Three item e valuators were assigned the task of 
reconciling the skill classifications on the two test batteries. These 
evaluators were: 1) a former test editor and testing consultant; 
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2) a graduate research assistant; and 3) the author of the mathematics 
tests of the lo^a Tests of Basic Skills, Eacli of these evaluators 
was familiar with the classification scheme used on the Iowa Tests, and 
each had previously reviewed the skills classification on the Stanford 
Diagnostic Mathematics Test. 

The reconciliation process > Each of the three item evaluators 
independently reclassified each item on the Stanford Diagnostic 
Mathematics Test into its corresponding Iowa Tests of Basic Skills 
skill classification. The three independent classifications were then 
compared and discrepancies were reconciled, with the author making 
the -final determination where disagreement existed. The result of this 
process was the reclassification and relabeling of the Items of the 
Stanford Diagnostic Mathematics Test so that direct comparisons 
between the two test batteries could be made. 

Comparisons of Performance on the Two Test Batteries S ubjects 

The students tested for this study were 2^8 flfw.i- and 260 sixth- 
grade students in a medium-sized (3,316 students, K-12) school district 
in eastern Iowa. The fifth-grade group was approximately 53,5 percent 
male and 46.5 percent female, and the sixth-grade group was approxinately 
46.5 percent male and 53.5 percent feniale. 

Testing procedures . Both test batteries were administered during 
the fall semester of the 1980-81 school year. All students were 
administered the Iowa Tests of Basic Skills in the regular district- 
wide testing program in mid-September, 1980. Approximately two weeks 
later, participating students were administered the Stanford Diagnostic 
Mathematics Test. The tests were given in classroom groups, according ^ 
to the directions specified in the manuals for the ref^pective tests and 
under typical tcst"**ng conditions. The tests were scored through the 
regular scoring services, provided by the Data Score Systems of 
Westinghouse Learning Corporation for both tests. Data tapes were 
obtained and merged for matched students, for whom complete item 
data were available on both tests. 

Data analysis . For each subtest in its originally published form 
and for the reclassified subtests, defined through the structural 
analysis described above, classical test <^tatistics were conputed. 
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These included item p-values and discrimination indices, test means, 
standard deviations, and reliabilities (KR-20). 

Using the test statistics generated and the student raw-scores, 
intercorrelations were computed between the published Concepts subtests, 
the ProbleiTV-Solving and Applications subtests, and Computation subtests. 
In addition, reliabilities of differences were computed, using the 
formula; 

r S^ + r S ^ - Zr SS 
r . . « XX X yy^y xy^x^y 

c 2 > c { - 2r.., S j.. 



Where: 



dd 



is the reliability of the difference. 
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r is the intercorrelation between the tests, 

, r ,r are the respective reliabilities of the tests, 
XX yy 

5 2^5 2 are the respective variances of the tests. 
X y 

(Stanley, 1971. p. 385). 
The same procedures for data analysis were repeated for the restructured 
tests, which included the three subtests (in restructured form) from 
the first analysis, plus a new subtest, called Graphs and Tables, which 
was defined in the structural analysis of the tests* 

RESULTS AND DISCUSSION 
Structural Analysis and Item Reclassification 

Both the Stanford Diagnostic Mathematics Test and the Iowa Tests 
of Basic Skills have subtests in Mathematics Concepts, Applications or 
Problem-Solving, and Computation. However, major differences were 
observed in the item types and contents found under the various subtest 
labels. 

It was found that nine items appearing in each Applications subtest 
of the Stanford Test were items involving the reading of graphs and 
tables. These items were similar to a subset of items appearing in the 
Work-Study Skills, Visual Materials subtest of the Iowa Tests. 
Therefore, for both batteries, a new subtest was defined and called. 
Graphs and Tables. These new subtests were included in the data analysis 
for the reclassified tests. 
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Aside from the inclusion of items frorr, the Work-Study Skills 
portion Of the Iowa Tests, the three Iowa Mathematics subtests were 
held intact for all of the analyses. The reclassification of the 
Stanford Test items is summarized in Table 15. The table shows, for 
example, that of the 36 grade 5 items in the Concepts subtest, all 
were reclassified into Concepts ski 11 categories by the item eva'iu- 
ators. However, an additional ten items, originally published as 
Applications items, were reel ass ifrW "into Concepts items, and nine 
items originally appearing in the Stanford Computation subtest were 
reclassifleil as Concepts items. The result of casting the Grade 5 
Stanford items Into Iowa Tests of Basic Skills classifications was a 
newly defined Concepts subtest of 55 items. This compares to a 
published version of the Stanford Concepts subtest of 36 items. 

The general pattern of the reclassifying of the Stanford items 
yielded a considerably heavier emphasis on mathematics Concepts than 
might be assumed by looking at the subtest titles. Another striking 
outcome of the reclassification was the drop in emphasis shown for 
measuring skills In solving story problems. Even If the story prob- 
lems and Graphs and Tables were combined, the resulting Applications 
subtest would be shorter by more than a third, because of the Concepts 
items Imbedded In it. Limitations in the Stanford Diagnostic Mathe- 
matics Test item classification schemes have been previously noted 
(Sawder, 1978), but these limitations have not been shown so drama- 
tically as they appear here. 

Only three items, of the 231 making up the two levels of the 
Stanford Diagnostic Mathematics Test, were of a type that had no 
counterpart on the Iowa Tests of Basic Skills. Therefore, although 
there appears to be some difference in the emphasis of the various 
mathematics skills tested between the two tests (as measured by the 
number of items allocated to different skill categories), the two 
test batteries do measure similar skills. In general, the Stanford 
samples a somewhat narrower content domain, but Includes a greater 
number of Items for each of those skills represented* These are the 
kinds of differences one would expect between a "diagnosiffc" and 
a "survey" battery with approximately the same total number of Items 
In each battery. The differences In emphasis In specific subskllls 
are Illustrated In Table 16. 



Table 15.* Nutrber of Stanford Diagnostic Mathematics Test Items Reclassified to Comparable Iowa 
Tests of Basic Skills Classifications. 





Concepts 


Applications 


Computation 


Total # of 
Items in the 
Reclassified 
Tests 


• 


Grade 5 


Grade 6 


Grade 5 


Grade 6 


Grade 5 


Grade 6 


Grade 5 


Grade 6 


# of items in the published 
tests , 


36 


36 


30 


33* 


48 


48 






# of items in the reclassi- 
fied tests 
















t " 


Concepts - 


36 


36 


10 


12 


9 


6 


55 


54 


Applications 


0 


0 


"11 


9 


0 


0 


11 


9 


Computation 


0 


0 


0 


0 


39 


42 


39 


42 


Graphs and Tables 


0 


0 


9 


9 


0 


0 


9 


9 



*No equivalent items or item classifications appeared in the Iowa Tests of Basic Skills for 
the Items in the Stanford Applications subtest. 
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Table 16. Comparison of Items Allocated to Specific Subskills on the 
Iowa Tests of Basic Skills and the Stanford Diagnostic 
Mathematics Test for Grades 5 and 6. 



Subtest: Mathematics Concepts 


T TDC 
ITBS 

Grade 6 


bUMT 

Grade 6 


TTDC 

Gi;ade o 


brace o 


Numeration^ number systems, and sets 


9 


14 


9 


20 


LOUnLiny aMU ilUill/Ci oCi ICO 




3 


- 


4 


Place value and expanded notation 


3 


6 


3 


5 


Properties of number systems 


3 


4 


4 




Subsets of number systems 


1 


1 


1 




Sets 


2 




1 




LC]UClL10nS» lilC^UCl 1 1 L 1 Cb f QilQ ilUliiJCr 










sentences 


4 


6 


5 


9 


Operational and relational 










symbols 


1 




2 




Solution of number sentences 


3 


6 


3 . 


y 


Whole numbers; Integers 


6 


13 


/ 




Reading and writing 


1 


3 


• 


3 


Relative values 




2 




Z 


Terms 


2 


2 


2 




PiinH;imon'f*;i1 nn0y*;)1*i AnQ * NiiinhPir 
ruiiuciiiKii uci 1 upcfOLiuiid* iiuiiE/cr 










facts 


1 




2 


2 


Fundamental operations: Ways to 










perform 


1 


4 


1 


3 


Fundamental operations: Estima- 










ting results and rounding 


1 


2 


2 


3 


Fractions 


8 


6 


7 


3 


Part of a whole and partitioning 

of a seV 
Relative values 
Equival^t fractions 
Terms / 

Fundapffental operations: Ways to 
perform - ,^ — 

Fundamental operations: Estima- 
ting results 

Ratio and proportion 


1 

1 

3 
1 

1 
1 


2 

o 
d 

1 
1 

« 


2 
i 
1 
1 

1 

1 


2 

i 
« 

} 

1 
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Table 16. Continued. 



Subtest: Matheinatics Concepts 


ITBS 
Grade 6 


SDMT 
Grade ( 


ITBS 
Grade 5 


SDMT 
Grade 5 


Decimals', currency, and percent 


5 


3 




2 


Readinc] and wntinQ 
Relative values 

Fundamental operations: Estima- 
ting iresults and rounding 

Equivalence: Decimals, fractions, 
and percents 

Probability and statistics 




2 
1 


• 


2 

• 


Geometry and measurement 

Measurement: Quantity, time, 

and temperature 
Measurement: Length and weight 
Recognizing types and parts of 

geometric figures 
Area and perimeter of plane 

figures 

Use of geometric figures in 
description and proof 


8 




9 


7 


1 
3 

1 

2 

1 


• 


2 
3 

2 

1 

1 


i 
3 

3 

• 


Subtest: Mathematics Problem 
Solving 


ITBS 
Grade 6 


SDMT 
Grade 6 


ITBS 
Grade 5 


SDMT 
Grade 5 


Single-step problems: Addition - 
Subtraction 

Currency 
Whole numbers 

Fractions, decimals, percents 


9 


1 


11 


3 


1 

1 7 

i 1 


1 

1 
t 

i 


3 
6 
2 


1 

2 


Single-step problems: Multiplica- 
tion - Division 

Currency 
Whole numbers 

Fractions , decimals , .percents 


1 7 i 2 


6 


4 


2 1 • 
4 1 2 

1 i 

1 


I 


4 

• 

1 
1 



/ 
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Table 16. Continued. 



Subtest: Mathematics Problem 
Solving 


I TBS 
Grade 6 


SDMT 
Grade 6 


! 

ITBS 
Grade 5 


SDMT 
Grade 5 


Multiple-step probl ems : Combi ned 
use of basic operations 

Currency 
Whole numbers 

Fractions, decimals, percents 


13 


6 


10 


4 


8 
5 


3 
3 


6 
4 


4 
• 


Subtest: Mathematics Computation 


ITBS 
Grade 6 


SDMT 
Grade 6 


ITBS 
Grade 5 


SDMT 
Grade 5 


Whole number 

, Addition 
Subtraction 
Mul ti plication 
Division 


28 


■ 33 


38 


39 


5 
9 
9 


Q 
0 

6 

12 
12 


1 A 

9 

12 
7 


c. 
u 

12 
12 
9 


Fractions 

Addrti on 
Subtraction 
Mul ti plication 
Division 


13 


3 


7 




6 
3 

■ 


2 
1 


6 
4 


• 


'Decimals 

Addition 
Subtraction 
Multiplication 
Division 


4 


6 






2 
2 

• ■ 


2 
1 
3 


• 
• 


• 


Subtest: Graphs and Tables 

« 


ITBS 
Grade 6 


SDMT 
Grade 6 


ITBS 
Grade 5 


SDMT 
Grade 5 


Reading amounts 

Using the scales on bar and 
line graphs 


1 


3 , 


2 


1 3 


1 


3 

i 


2 


I 

i 3 
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Table 16. Continued. 



— ^ 1 

Subtest: Graphs and libles 


I TBS 
Grade, 6 


SDMT 
Grade 6 


I TBS 
Grade 5 


SDMT 
Grade 5 


tofipan ng quan ti ti es 


4 


» 

6 


8 


6 


Determining rank 

Determining differences between 

amounts 
Determining ratios 


• 4' 


2 
4 


. 3 


3 



\ 

Another way of determining whether the two batteries measure 
mathematics skills in comparable ways isr through statistical analysis ^ 
,of the performance of students. vSyph an analysis is discuss^:! in the 
following section. 

> 

ConpaHsons of Performance on the Iowa Tests of Basic Skills and the 
Stanford Diagnostic Mathematics Tesc 

The analysis of performance involved two stages, one for the tests 
as they were published, and a parallel investigation for the tests as 
they were reclassified through the structural analysis. In ich case, 
item p-values were determined and discrimination indices were 
computed. Tablf. 17 presents the mean p- values and mean item- total • 
correlations for the various subtests. It can be seen from the table 
that the Stanford Diagnostic Mathematics Test is a considerably easier 
test than the Iowa Tests of Basic Skills. This finding .is not surpri- 
sing, since the Stanford Test has been specifically designed to 
discriminate well among the lower achieving students on the skills 
tested. The Stanford also shows somewhat higher mean biserial 
discrimination indices than the Iowa Tests. This difference in 
discrimination values is most likely caused by the greater content 
homogeneity of the Stanford tejts.' 

The reliabilities of the various subtests are presented in Table 18 
on page 48, along with the intercorrelations between similar subtests on 
the Stanford an-i the Iowa" batteries . In addition, estimates of the 
reliability of differences between subtest scores are given. 
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l«ble 17. Mean p-values and Biserial Correlations for the fublished and Reclassified Tests. 

SDMT ( Reclassified)^ 





^TBS 


' SDMT (Published) " 


<^ Subtest 


P 


mean r.. 


.P i 

1 

H 


mean r^^^^ 














59.27 


.487 


78.64 


.589 


Prcblems/Appli cati ons 


62.44 


.574 


73.53 


.623 


Computation 


60.64 


.548 


82.58 


.634 


Qrcipnb anu i au i 


57.60 


.417 






Grade' 6 ^ 










Concepts 


57.12 


.559 


55.58 


.599 


Probl ems/ Appli cations 


59.86 


.602 


' 70.58 


.635 


ii^oflnputation 


60.91 


.612 


73.38 


.654 


J^^hs and^bles* 


41.20 


.420 







77.49 
67.64 
83.51 
82.00 

59.18 
71.33 
72.88 
80.89 



mean r 



bis 



.616 
.625 
.600 
.679 

.600 
.676 
.646 
.713 



Graohs and Tables subtest, tor ooin iiui ana auni, is a v,rcatcu auu.,^^., --...3 rl.V 

frorthl work-sfudy Skills. Visual Materials Test of ITBS and fr6m the Applications Test of 
the SDMT. 



Table 18. Nunter of Items and KR-20s by Test; and Intei correlatiolps and Reliabilities of 
Difference for Like Tests for Each Grade 'evel and for the Published and, the 
Reclassified Forms of the Tests. 









Concepts 


Problems 


Compu; tion 


Graphs 
And Tables 




Grade 


N 


I TBS SDMT 

! 


I TBS SDMT 


iTBS SDMT 


ITBS SOMT 


PUBLISHED TESTS 














Ho. of Items (k) ^ 
KRr20 

Intarcorrelation 
Reliability of Difference 


5 


288 


37 36 
.82 .85 
.674 
.525 


27 30 
.84 .86 
.740 
.403 


45 48 
.38 .89 
.629 
.701 





No. of Itent, (k) 
KR-20 

Intercorr*»"!ation 
Reliability of Difference 


6 


260 


40 36 
.88 .88 
.844 
.251 


29 33 
.87 .08 
.771 
.449 


45 48 
.90 .92 
.770 
.614 





RFaASSIFIED TESTS 














No. of Items (k) 
KR-20 

Intercorreldtion 

tell ability of Difference 


5 


288 


37 55 
.8*; .90 
.747 
.434 


27 11 
.84 ,72 
.720 
.213 


45 3' 
.88 .fk 
. -17 
.663 


10 9 
.44 .70 

.:63/ 
.321 


No. of Items (k) 
KR-20 

Inter correlation 

Re Ti ability of Difference 


6 


260 


40 54 
.88 .92 
.858 
.299 


29 9 
: .87 .74 

.683 
j .394 

1 


45 42 
.90 .91 
.756 
.610 


5 9 
.39 .74 
.345 
.333 



The subtest reliabilities are generally respectable for both test 
batteries., rangirij between .70 and .92, except for the Graphs and 
Tables subtests created for the Iowa Tests of Basic Skills. Further, 
for the longer subtests overall, the reliabilities range above .80. , 

With the srall nunters of items (rarging from 5 to 10) in the 
Graphs and Tables subtests, relatively low reliabilities were found. 
Since the low reliabilities associated with the subtests restricted the 
intercorrelations between the tests and the restricted intercorrelations 
in turn inflated tiie reliabilities of differences for these tests, the 
results are presented, but not discussed. 

The interpretation of the reliabilities of differences should help 
in determining whether the two test batteries are functioning differ- 
ently in a statistical sense. In one respect, the intercorrelations 
presented in Table 18 could be viewed as concurrent validity coeffi- 
cients. That is, as estimations of the degree to which the two test 
batteries measure the same attributes. From that perspective, the 
correlations are reasonably high. However, this raw correspondence 
and its i.-.terpretation can be ennanced by study of the reliabilities of 
difference. In this case, the reliabilities of differences represent a 
measure of the stability of the difference scores observed between, 
for example, the, two Concepts subtests. The higher the reliability 
of the difference, the more stable that difference is assumed to be. 
In other words, "real" differences, rather than differences attributable 
to error, ar^ associated with high reliabilities of differences. When 
high intercorrelations and low reliabilities of differences exist, the 
subtests can be said to not be measuring statistically unique attributes. 

The interpretation of reliabilities of differences can be 
approached in the same way a test user would approach interpreting the 
reliability of a test (Schreiner, Hieronymus, & Forsyth, 1969). 
However, it shof'd be clear that high reliabilities of difference can 
be obtained only when two highly reliable measures, with low intercor- 
relations have been used (Stanley, 1971). 

The reliabilities of differences presented in Table 18 do not 
provide a definitive answer to the question wnether the Iowa Tests and 
the Stanford Test measure the same attributes. In fact, some of the 
results are indeed surprising. 



- 49 - 



First among the surprises contained in Table 18 is the fact that 
for the published Concepts subtests, the reliability of differences 
was markedly different for the fifth- and sixth-grade groups. This 
finding indicates that the similarities in the Concepts subtests are 
greater at the sixth-grade level than at the fifth. This difference 
was somewhat ^•educed under the analysis of the reclassified tests and, 
perhaps, supports the earlier contention that reclassification of 
Stanford items led to structuring reasonably comparable tests, in 
terms of skill coverage. The same phenomenon, of reduced difference 
between fifth- and sixth-grade results appeared in the Computation 
subtest, and may further support the preceding statement. In general, 
the reliabilities of differences among the published versions of the 
Problems and Computation subtests were comparable between the two 
grades. 

The second surprise was the high reliabilities of differences for 
the Computation subtests, relative to either tl;e Concepts or Problems 
subtests. This finding suggests tha]: the greatest likelihood of the 
Iowa and Stanford tests measuring different attributes is found in the 
Computation subtests. While this finding is counter-intuitive, it may 
be explained in part through a consideration of the speededness of the 
two tests. Since t\\e Iowa Computation subtest is relatively speeded, 
but the Stanford Computation subtest is essentially a power test, the 
difference may be, in part, explained. The difference may not be one 
of computation skills measured, but one of speed and accuracy, versus 
accuracy alone. 

Third among the surprises was that the differences between the 
published forms and the reclassified forms of the tests were, in 
general, relatively small. The restructuring of the Stanford Tests 
according to the Iowa's skills cla'.sifi cation scheme did have substan- 
tial impact on the Problems (Applications) subtest. However, at 
least a portion of that impact can be attributed to the small number 
of "story" problems left after the reclassification, and the consequent 
lowering of the reliabilities of the subtests. 

It snould be noted that all of the reliabilities of differences 
are inflated to an unknown uogree, The intercorrelations between tlie 
Stanford tests and Iowa tests include day-to-day sources of variation 
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in pupil performance, while the Kuder-Richardson 20 reliability 
coefficients for these tests do not. The use of more appropriate 
parallel forms reliabilities in estimating tlie reliabilities of 
differences would have lowe«*ed the obtained values substantially. 
The effect of taking into consideration different sources of error 
•in computing such reliabilities is documented in the Manual for 
Administrators. Supervisors, and Counselors of the Iowa Tests of 
Basic Skills (Hieronymus & Lindqulst, 1974, pp. 71-73). 

SUMMARY OF THE RELATION SHIPS BETWEEN THE lOtJA T ESTS OF 
B^IC SKIL LS AND THE STANFORD DIAGNOSTIC MATHEMATICS TE ST 

The study of the relationships between the Iowa Tests of Basic 
Skills and the Stanford Diagnostic Mathematics Test was carried out on 
two levels. First, a structural match between the two test batteries 
was undertaken, and second, a correlational study of the intercorrela- 
tions of "like" subtests and the reliabilities of differences, invol- 
ving 288 fifth- and 260 sixth-grade students, was conducted. 

The structural analysis led to conclusions that a number of items 
appearing in the Applications subtest of the Stanford were, in fact, 
according to the Iowa skills classification, measuring mathematics 
concepts. Additionally, there were graphs and tables items in the 
Applications test that corresponded to items from the Work-Study Skills 
area of the Iowa Tests, and computation items that were considered tc 
be measuring concepts as defined for the Iowa Tests. In general, 
however, almost all (99 percent) of the items on the Stanford Diagnostic 
Mathematics Test were found to have equivalent counterparts on the 
Iowa Tests of Basic Skills. The conclusion draw was that the two 
test batteries measure essentially the same mathematics skills. 

The study of intercorrelations and reliabilities of differences, 
however, did not leac to as clear cut a conclusion. The intercorrela- 
tions, while creditably high for purposes of looking at construct 
validity, led to reliabilities of differences that were also higher 
than would be expected if the tests were measuring the same attributes 
in the sare way. /sithough, as noted, these reliabilities were somewhat 
inflated since the KR-20 reliabilities and intercorrelations used in 
their computation contained different sources of error variance. 
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The level that a reliability difference must attain to be 
significant is an interpretation problem, however, not a statistical 
problem. For purposes of work with individual student scores, the 
reliabilities of differences could, therefore, be considered to be 
generally low enough to be attributable to measurement errors. Thus, 
the conclusion that the same skills are being measured was tentatively 
supported. 

PART VI: SUMMARY OF THE PROJECT Mid CONCLUSIONS 

This (project was implemented to study the need for, feasibility of, 
and impact of an interpretation technique designed for use with a 
standardized achievement test. The interpretation technique was 
developed for use with the Iowa TesLs of Basic Skills, and the studies 
reported were specific to that test. However, the similarities between 
the reporting systems of the Iowa Tests and other major standardized, 
achievement test batteries, and the general principles applied in the 
development of the interpretation technique make the approach general i- 
zable to other tests for which the subskiUs measured are fairly well 
defined. 

The main focus of the project was to assess the impact of the 
interpretation on students and teachers. This study was reported in 
Part III: The Impact Study. However, two other important questions 
were addressed through the project. The first of these, addressed in 
Part IV: Teachers* Predictions of Student Performance on Subskills of 
Mathematics, dealt with the accuracy of teachers' expectations of 
student performance on the tests. The importance of this study was 
its focus on the commonly held belief that the subjective observations 
that teachers make in their day-to-day classroom activities lead to 
the formation of accurate assessments of student skill development. 

The second important question studied, beyond the impact of 
interpretation, was whether the Iowa Tests of Basic Skills, Mathematics 
subtests, were comparable in design and function to a widely recognized 
"diagnostic" mathematics test. This study, reported in Part V: 
Relationships Between the Results of the Iowa Tests of Basic Skills, 
Mathematics Subtests, and the Stanford Diagnostic Mathematics Test, 
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addressed the feasibility of the diaynostic interpretation of the 
"survey" test results. This study was important in establishing or 
refuting the basic premise upon which the interpretation technique 
was developed. 

The findings of the three studies incorporated into this project 
lead to a conclusion that there is a need for the interpretation of 
the results of tests administered in school-wide testing programs. 
At least two bases for this conclusion were found. First, students 
who have been through an interpretation process feel that they have 
done better on the test than students who have not had the test 
results interpreted to them. Presumably, it can be inferred from this 
finding that students will then feel "better" about themselves and 
their skill development. Secondly, the act of interpretation should 
raise important questions for the teachers as discrepancies between 
expectations and actual r formance occur. This should benefit both 
students and teachers, as reasons for the discepancies between the 
students' behaviors and the teachers' expectations are f»xplained. The 
benefit for teachers should be an opportunity to: 1) reassess their 
expectations for certain students; and 2) examine some of their biases 
about the performance of certain subgroups in the subject areas tested. 
The benefit for students should be a better educational process borne 
out of better expectations for themselves and more appropriate expec- 
tatioiis from their teachers, regardless of the student's sex or 
overall achievement level. 

There was modest support for providing "diagnostic" interpretation 
of the "survey" test. This support came through the study of rela- 
tionships between the Iowa Tests and the Stanford Diagnostic Mathema- 
tics Test. One weakness of this study may be found in the definition 
of a "diagnostic" test, and in this case, the Stanford was used 
primarily because it is promoted as a diagno$.tic Instrument. This, 
previously challenged use for survey tests among testing professionals, 
but often practiced use among teachers and counselors, still is the 
source for the most serious cautions in the interpretation process 
presented. 

Tne problem that arises in the "diagnostic" interpretation of 
tests like the Iowa Tests of Basic Skills is that highly related sub- 

* 

skills become the focus of attention in the interpretation. The 
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subskills of mathematics, for example, yield high intercorrelations 
in part because the student that achieves well in one area of mathema- 
tics is likely to also achieve well in other mathematics skills. 
These high intercorrelations lead to low statistical reliabilities of 
difference between the subskills. This means that profiles of scores, 
observed at one testing period, may not be stable if the test is 
administered again. In these instances, it can be argued that a 
measure on one skill is indicative of the student's ability on the 
other, and any difference in the student's profile is attributable to 
measurement error. This statistical argument ignores the qualitative 
differences between the sets of Items, but it is, none the less, an 
important consideration in the use of an interpretation technique such 
as the one studied here. 

This area of profile analysis on achievement test results is one 
that deserves a great deal more attention than it has received. Most 
of the existing research In the area has been done in reading comprehen- 
sion, under some fairly restrictive assumptions about what readers are 
like. The area could benefit from studies that replicate interpretation 
practices that more closely resemble those that occur in practice and 
through extension of the investigations to other subject areas 
represented on tests. 

Another important caution regarding this interpretation technique 
is that it sets the test results Into fairly concrete, easy to 
understand terms (I.e. the raw scores for the subskills tested). 
While this approach deirystl fies the test interpretation process to 
some extent, it also could lead to overconfldence or overlnterpretation 
of the scores. It is important for the user to keep these performances 
in perspective just as they should any other test score. Although this 
caution is an important one, the results of the impact study suggest 
that this may be an unfounded concern about the process. Very few 
short term changes in attitudes about the test or its uses were shown to 
be related to the interpretation process. 

In summary, three studies were conducted in relation to an inter- 
pretation process designed to actively involve students in the inter- 
pretation of their performance on a standardized achievement test. The 
studies provided support for the need, the feasibility, and a few 
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important outcomes of the interpretation technique. One of the 
important, but unstudied, underlying aspects of this interpretation 
technique is that the student becomes an active, rather than a 
passive, recipient of test results. The impact of these two different 
approaches to providing test results is another area for future study. 
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APPENDIX A: TEACHER SURVEY 



GENERAL DIRECTIONS: 



DIRECTIONS; 



This survey consists of several types of questions to assess your 
opinions and knowledge of the Iowa Tests of Basic Skills . Please 
mark your answers directly on this survey. Specific directions 
are given with each type of question. 

Below is a list of possible uses for Iowa Tests of Basic Skills results. 
Using the following code, please give your opinion on the value of each 
use by checking (/or x) the appropriate column. 



1« Extremely valuable for this use. 

2= Very valuable for this use, the same use could be met in orher ways 

only with great difficulty. 
3= Valuable, the same use could be met in other ways with some effort. 
4= Somewhat valuable^ the same use could be met in other ways without 

much difficuUy. 

5= Minimally valuable, the test results are useful for "added information" 

but not for meeting the objective. 
6= Not valuable, the test results create problems or detract from other 

information that could be better used to meet the objective. 





Extremely 
Valuable 


Very 

Valuable 


Valuable 


Somewhat 
Valuable 

*- 


^ Minimany 
; Valuable 


' Not 

1 Veluable 


Reporting to local news media 














Reporting to boards of education 














Reporting to parents 














Screening of special education students 














Planning instruction for individual students 














Planning instruction for groups of students 










...I 




Comparing individual scores with performance 
of a state or national peer group 














Evaluating specific teaching procedures or 
methods 














Comparing classes within a school 














Measuring individual growth from year 
to year 














Identifying system-wide strengths and 
weaknesses 














Identifying individual strengths and 
weaknesses 














Grouping students for specific instruction 
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DIRECTIONS: Please answer the following opinion questions by putting the number 
of your answer in the blank at the right of each question. 

14. How relevant are the results of the ITBS to your work with Students? 
(1) Not at all relevant (2) not very relevant (3) Somewhat relevant 
(4) Very relevant (5) Extremely relevant 

15. How useful are the results of the ITBS in identifying strong or weak 
points in the curriculum? (1) Not at. all useful (2) Minimally useful 
(3) Useful to some extent (4) Useful to a great extent (5) Useful to 
a very great extent 

16. How useful are the results of the ITBS in discussing future instructional 
plans with individual students? (1) Notiat all useful (2) Minimally 
useful (3) Useful to some extent (4) Useful to a great extent (5) Useful 
to a very great extent- 

17. How closely do the skills tested on the ITBS match the skills in the 
curriculum you actually teach? (1) Very high match (2) High match 

(3) Medium match (4) Low match (5) Very low match 

18. To what extent do you think the results of the ITBS can be used for 
improving students' understanding of their specific strengths and 
weaknesses? (1) Not at all (2) To a mimimal extent (3) To some extert 

(4) To a great extent (5) To a very great extent 

19. How useful are the results of the ITBS in helping parents better under- 
stanjd the strengths and limitations of their child? (1) Not at all 
useful (2) Minimally useful (3) Useful to some extent (4) Useful to 

a great extent (5) Useful to a very , great extent 

20. How well informed do you consider yoursel' to be about th'e ITBS? 

(1) Not informed (2) Minimally informed (3) Informed (4) Well informed 

(5) Extremely well informed 

21. How would you rank the overall quality of the ITBS as compared to other 
standardized tests of its type? (1) One of the best (2) Above average 
(3) About the same' as others (4) Below average (5) One of the worst 



DIRECTIONS: This next set of multiple choice questions assesses your' 
knowledge about the ITBS. There is one best answer for each. Please 
put the number of your a,./'er in the right hand blank. If you are not 
sure of an answer, take a guess. 

22. In attempting to dete.Tnine whether or not the Iowa Tests of Basic 
Skills is appropriate for your school system, what is the -most 
important issue to consider? / 

(1) Does the test battery have sufficiently high reliability? 

(2) Do the test items of the battery correspond to the content 
of instruction 1n your school System? 

(3) Is the test battery based on a thorough survey of teaching 
practices over the whole country? 

(4) Is the student population upon which the test norms are 
based comparable to the student population of your school * 
system? 

23. Which of the following greatl^adds to the reliability of the 
ITBS results? 

(1) The homogeneity of the group tested. 

(2) The .number of types of items on 'the tests. 

(3) 'The number of persons in the norming population, - ' 

(4) The length of the test battery. 

24. The most serious criticism of the ITBS involves the 

(1) Unwise uses made of test results, 

(2) Inappropriateness of this tejt in measuring what is being 
^ taught.ln schools today. 

(3) Inappropriateness of comparing the scores of urban students 
to those of rural students. 

(4) The relatively low level of accuracy of test procedures. 

25. Skills analysis is lea^st useful for 

(1) Planning instruction for groups of students. " 

(2) Identifying individual strengths and weaknesses .j" 
(3] Measuring individual growth from ye^r to year. - 

(4j Identifying general class wide strengths and weaknesses. 

26. Which of the following is thft biggest^ problem in interpreting tne 
subskills of the ITBS: i 

(1) The interpretation p-'-^cess \i confusing for many students. ^ 

(2) The low achieving st. jents ar? not able to id^nti^y a'ay 
strong areas. 

(3J The interpretation usually doesn't provide useful information 

about average students. 
(4) The differtncts between subskills can be over-emphasized. 
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27. When doing a skills interpretation of the ITBS, it is most appropriate 
that the results, be viewed as: 

(1) Valid Measures of abilHy. ' 

(2) Accurate measures of a student's progress in the stibskills. 
. (3) Terttative indicators of strength and weakness. 

(4) Definite guides to remediating weaknesses and Capitalizing 
on strengths. 



DIRECTIONS: There is a best answer for each of the following True/False 
. items. Please use "1" for TRUE 'and "2^* for FALSE.. Again, 
if you are not sure, please guess. 

1 = TRUE , ' 2 = FALSE 

28. The ITBS show students ^ achievement -in some scnopl subjects hat 
are important for future school success. 

29. A good use of the ITBS is to give grades at the end of each quarter • 

or semester. 

30. The reading test of the ITBS measures three kinds of understanding: 
Facts, Inferences, and Generalizations. 

31. The ITBS measure all of the skills most students are taught in school. 

32. If a student misses most or all of the questions about sonie skill ^ 
tested, it means that she or he does not know anything about that 
skill. 

33. If a student answers all the questions about a skill correctly, 
it' means he/she has mastered that skill. 

' 34. One of the main purposes of the ITBS is to help students unier- 
stand what their strengths and weaknesses are. 

35. The questions for each skill tested on the ITBS are all of about 
the same difficulty. 
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APPENDIX B: STUDENT SURVE^ 



DIRECTIONS: The following ten questions ask fc ^ ycur opinions about tests* 

There are no right or wrorig answers. Please answer the questions 
using the purple Standard Answer Sheet by filling in the space 
below the appropriate letter. If ycj have any question*, raise 
your hand. 



• 1. how well do you think you did on the Iowa Tests of Basic Skills this year? 

A. Quite high 

B. Above average 

C. Average 

D. Below average 

E. Quite low 

2, In general, how do you feel about tests that teachers make up? 

A. I really like them 

B. 1 like them 

C. I don't care one way or the other 

D. I hate them 

£• I really i.ate them 

3, In general, how do you feel about the Iowa Tests of Basic Ski Us? 

A. I really like them 

' ^ B. I like them 

J C. I don't care one way or the other 

/ ^ D, I hate them 

E. I really hate them 

■s 

0 

4, How hard do you think te3t5 like the Iowa Tests of Basic Skills are? 



A. Very hard 

B, Hard 

C Medium ^ ^ 

0. Easy 

E. Very easy 

Z. How hard are the tests your teacher Tckes up? 

A. Very hard 

B. Hard 

C. Medium 

D. Easy 
Very easy 

6, How nervous do you feel before you take a test that your teacher made up? 

A. Extremely nervous 

B. Very nervous 

C. ' Nervous 

D. Just a little nervous Ij^j 

E. Not at all nervous 
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7. How nervous do you ^feel before *you take a test like the Iowa Tests of Basic Skill s? 

A. Extremely nervous 

B. Very n^ervous 

C. Nervous 

D. Just a little nervous 

E. Not at all nervou^ 

8. 'lOw many questions on the Iowa Tests of Basic Skills cover things you have studied in 
school? 

A. All of them 

B. Most of them 

C. Some of them 

D. Only a few of them 

E. None of them 

9. Hbw much do you think you know about the te$ts on the Iowa Tests of Basic Skills ? 

A. A lot ^ 

B- Quite a bit 

C. A little 

D. Not much at all 

E. Nothing 



10. 



How useful are the Iowa Tests of Basi<f Skills results to you? 



0 



A. 
B. 
C. 
D. 
E. 



ExtV*emely useful 

Very useful 

Useful 

Not .useful 

Not at all useful 
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JiRECTIONS: The ne)^^ set of questions tostb your knowledge about the Iowa Jests oj; >.a_sic 

SkilU. Mhey ao nave a right or wror.q answer. The sentences below' arV " 

e'lThe? tru^ or false. If you think a sentence is true mark an "A" on vour 

answer sheet\ If you think it is false mark a "B". If you at\^ not sut^s 
tdkf a guess.- 



^ . A = TRUE B = FALSR 



w 

11. Scores from the Iowa Tests of Basic Skills are most often used to find out what sub- 
jects, like science or reading or math, students are interested in, 

12. The Iowa Tests of Basic Skills show how well students do in some school subjects that 
are important for future school sucr^^.s. 

13. Tne Iowa Tests of Basic Ski lls tell how well students work together in grouDS. 

14. The Iowa Tests of Basic Skills tell how students feel about the school subjects 
they study. 

15 A good use of the Iowa Tests of Basic Skills is to give grades at the end of each 
quarter or semester. 

16. Scores from the Iowa Tests of Basic Skill s should almost always be used to helo in 
^ making plans for students' future stu'^y. 

*17. The lowe Tests of Be.sic Skills covers skills in math and reading but NOJ language. 

• 18. The reading test of the Iov;a Tests of Basic Skills measure three kinds of under- 
standing: Facts, Inferences, and Generalizations. 

19. The Iowa Tests of Basic Skills test all of the skills most students are taught in school. 

20. If a student misses most or all of the questions about some skill tested, it means 
that she or he does not know anything about that skill. 

21. If a student answers all the questions about a skill correctly, it means he/she 
knows all tne important things about that skill. 

22. One of tne main purposes of the Iow a Tests of Basic Skills is to helo students 
understand what their strengths and weaknessess are. ^ 

^23. A student snould always do his/her best to answer the questions on the test in order 
for the test to be most useful. 

^24. The questions for each skill tested on the Iowa Tests of Basic Skills are all of 
▼ about the same difficulty. 



11. 


F 


14. 


F 


17. 


F 


20. 


F 


23. 


T 


12. 


T 


15. 


F 


18. 


T 


21. 


F 


24. 


F 


13. 


F 


16. 


T 


19. 


F 


22. 


T 
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